Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 59787 invoked from network); 28 Feb 2006 08:58:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 28 Feb 2006 08:58:09 -0000 Received: (qmail 67419 invoked by uid 500); 28 Feb 2006 08:58:01 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 67385 invoked by uid 500); 28 Feb 2006 08:58:00 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 67368 invoked by uid 99); 28 Feb 2006 08:58:00 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Feb 2006 00:58:00 -0800 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of NYH@il.ibm.com designates 195.212.29.152 as permitted sender) Received: from [195.212.29.152] (HELO mtagate3.de.ibm.com) (195.212.29.152) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Feb 2006 00:57:59 -0800 Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate3.de.ibm.com (8.12.10/8.12.10) with ESMTP id k1S8vbQk210334 for ; Tue, 28 Feb 2006 08:57:37 GMT Received: from d12av04.megacenter.de.ibm.com (d12av04.megacenter.de.ibm.com [9.149.165.229]) by d12nrmr1607.megacenter.de.ibm.com (8.12.10/NCO/VER6.8) with ESMTP id k1S8vqwQ231018 for ; Tue, 28 Feb 2006 09:57:52 +0100 Received: from d12av04.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av04.megacenter.de.ibm.com (8.12.11/8.13.3) with ESMTP id k1S8vaSn014569 for ; Tue, 28 Feb 2006 09:57:37 +0100 Received: from d12mc102.megacenter.de.ibm.com (d12mc102.megacenter.de.ibm.com [9.149.167.114]) by d12av04.megacenter.de.ibm.com (8.12.11/8.12.11) with ESMTP id k1S8vaWs014564 for ; Tue, 28 Feb 2006 09:57:36 +0100 Subject: Efficiently updating indexed documents To: java-user@lucene.apache.org X-Mailer: Lotus Notes Release 7.0 August 18, 2005 Message-ID: From: "Nadav Har'El" Date: Tue, 28 Feb 2006 10:49:08 +0200 X-MIMETrack: Serialize by Router on D12MC102/12/M/IBM(Release 7.0HF90 | November 16, 2005) at 28/02/2006 10:57:51 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N A few days ago someone on this list asked how to efficiently "update" documents in the index, i.e., delete the old version of the document (found by some unique id field) and add the new version. The problem was that opening and closing the IndexReader and IndexWriter after each document was inefficient (using IndexModifier doesn't help here, because it does the same under the scenes). I was also interested in doing the same thing myself. People suggested doing the deletes immediately and buffering the document additions in memory for later. This is doable, but I wanted to avoid buffering the new documents (potentially large) in memory myself (let Lucene do whatever buffering it wishes in IndexWriter). I also did not like the idea that in some periods of time, searches will not return the updated file, because the old version was already deleted and the new version was not yet indexed. I therefore came up with the following solution, which I'll be happy to hear comments about (especially if you think this solution is broken in some way or my assumptions are wrong). The idea is basically this: when I want to replace a document, I immediatly add (with IndexWriter.addDocument) the new document to the open IndexWriter. I also save the document;s unique id term to a vector "idsReplaced", of terms we will deal with later: private Vector idsReplaced = new Vector(); public void replaceDocument(Document document, String idfield, Analyzer analyzer) throws IOException { indexwriter.addDocument(document, analyzer); idsReplaced.add(new Term(idfield,document.get(idfield))); } Now, when I want to flush the index, I close the IndexWriter to make sure all the new documents were added, and then for each id in the idsReplaced vector, I remove all but the last document with this id. The trick here is that IndexReader.termDocs(term) returns the matching documents ordered by internal document number, and documents added later get a higher number (I hope this is actually true... It seems like that in my experiments), so we can delete all but the last matching document for the same id. The code looks something like this: // call this after doing indexwriter.close(); private void doDelete() throws IOException { if(idsReplaced.isEmpty()) return; IndexReader ir = IndexReader.open(indexDir); for(Iterator i = idsReplaced.iterator(); i.hasNext();){ Term term = (Term) i.next(); TermDocs docs = ir.termDocs(term); int doctodelete = -1; while(docs.next()){ if(doctodelete>0) ir.deleteDocument(doctodelete); doctodelete=docs.doc(); } } idsReplaced.clear(); ir.close(); } I did not test this idea too much, but in some initial experiments I tried, it seems to work. -- Nadav Har'El nyh@il.ibm.com +972-4-829-6326 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org