Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C450BCE66 for ; Thu, 12 Jul 2012 18:53:58 +0000 (UTC) Received: (qmail 33199 invoked by uid 500); 12 Jul 2012 18:53:56 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 33044 invoked by uid 500); 12 Jul 2012 18:53:56 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 33031 invoked by uid 99); 12 Jul 2012 18:53:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2012 18:53:56 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_FREEMAIL_1,FSL_FREEMAIL_2,FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of simon.willnauer@gmail.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-ob0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2012 18:53:51 +0000 Received: by obbtb18 with SMTP id tb18so4586377obb.35 for ; Thu, 12 Jul 2012 11:53:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=JMcq8ruw3BEqJIBB+ObYO7jLhHZEi250sF846onPPI4=; b=WtDoVWByClpyY1nLEUA15F4KqYzhTpHmtbFzP7O2EPhFkmwKmwUm6563F6D8abueqI v+UlJy+1qUb6H7G5XCgD2ydEzLAZCMMVRqR7xTRglBs15abcnG4UC0MBL7hzRiUty/2T ysRcVLKRNmj7gZKpeCbLr82D6FQ2SoCw7H8CVYPM9iB7JMm7p/bNCwK3ZFWhueWpPY0K tImS3mswhP5/Pk7iiwuo6LRzE3pFnOO/Sh6UOza6PEPdP1kmW/QEMI+fQEOtszFxyoc4 ZV0Z2ACUobMM1g6ROIM/Mv9Q2+pYFJu7aZERcB+QTR2MNELABU5X4a5aB9bWifMzIAXS FlsQ== MIME-Version: 1.0 Received: by 10.182.197.73 with SMTP id is9mr35436571obc.32.1342119210600; Thu, 12 Jul 2012 11:53:30 -0700 (PDT) Received: by 10.60.115.33 with HTTP; Thu, 12 Jul 2012 11:53:30 -0700 (PDT) Reply-To: simon.willnauer@gmail.com In-Reply-To: References: <02f201cd604b$386a32b0$a93e9810$@thetaphi.de> Date: Thu, 12 Jul 2012 20:53:30 +0200 Message-ID: Subject: Re: delete by docid in lucene 4 From: Simon Willnauer To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges wrote: > Thanks for the tip. > > Does using updateDocument instead of addDocument affect > indexing/search performance? it does affect index performance compared to add document but that might be minor compared to your analysis chain. I wouldn't worry about updateDocument its the only sensible way to use lucene really. Why didn't you use this before, any reason? What is your ingest rate / doc throughput and where would you get concerned? simon > > Sean > > On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler wrote: >> The trick is to index not with addDocument(Document) but instead with >> updateDocument(Term, Document). Lucene then adds the document atomically >> while deleting any previous documents with the given term (which is qour >> unique ID). If the key does not exist it simply indexes without deleting >> anything. >> By this you always have only one document with the same Term (==your unique >> ID). >> >> Uwe >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: uwe@thetaphi.de >> >> >>> -----Original Message----- >>> From: Sean Bridges [mailto:sean.bridges@gmail.com] >>> Sent: Thursday, July 12, 2012 5:42 PM >>> To: java-user@lucene.apache.org; simon.willnauer@gmail.com >>> Subject: Re: delete by docid in lucene 4 >>> >>> We have indexer machines which are fed documents by other machines. >>> If an error occurs (machine crashing etc) the same document may be sent to >> an >>> indexer multiple times. Serial ids are assigned before documents reach >> the >>> indexer, so a document, may be in the index multiple times, each time with >> the >>> same serial id. >>> >>> When the index gets large enough, the indexer will stop writing to the >> index, >>> and upload it to another machine, which keeps the index forever. Before >> we >>> upload the index, we forceMerge(1) on it, and gather some stats about the >>> index like max,min serial id, total documents. While calculating max and >> min >>> serial id, if we see a duplicate serial id, we call >> IndexReader.deleteByDocId(...) . >>> >>> We could check for duplicate serial ids while indexing, but that is racy, >> and not >>> as efficient. >>> >>> Thanks, >>> >>> Sean >>> >>> >>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer >>> wrote: >>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges >>> wrote: >>> >> Is it possible to delete by docId in lucene 4? I can delete by docid >>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that >>> >> method is gone in lucene 4, and IndexWriter only allows deleting by >>> >> Term or Query. >>> > >>> > that is correct. In lucene 4 IndexReader is really just a reader! >>> >> >>> >> This is our use case - In our system, each document is identified by >>> >> a unique serial id. If an error occurs, we may index the same >>> >> message multiple times. When an index grows large enough, we stop >>> >> adding to it, and optimize the index. During optimization, if we see >>> >> multiple docs with the same serialid, we delete all but the first, as >>> >> all documents with the same serialid are the same. >>> > >>> > I am wondering why you don't use the IW#updateDocument(Term,Doc) >>> > method? do you rely on multiple versions of the same doc? With Lucene >>> > 4 relying on the doc id can become very tricky. If you use multiple >>> > threads you create a lot of segments which can be merged in any order. >>> > You can't tell if a document ID maintains happened-before semantics at >>> > all. >>> > >>> > Can you tell us more about your usecase and why you are using >>> > deleteByDocID >>> > >>> > simon >>> > >>> > >>> >> >>> >> Thanks, >>> >> >>> >> Sean >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> >> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >> >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> > For additional commands, e-mail: java-user-help@lucene.apache.org >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org