Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 51109 invoked from network); 8 Dec 2006 15:52:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Dec 2006 15:52:07 -0000 Received: (qmail 62254 invoked by uid 500); 8 Dec 2006 15:52:02 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 62053 invoked by uid 500); 8 Dec 2006 15:51:58 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 61995 invoked by uid 99); 8 Dec 2006 15:51:58 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Dec 2006 07:51:57 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Dec 2006 07:51:46 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 248FE714313 for ; Fri, 8 Dec 2006 07:51:26 -0800 (PST) Message-ID: <18735318.1165593086147.JavaMail.jira@brutus> Date: Fri, 8 Dec 2006 07:51:26 -0800 (PST) From: "Michael Busch (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) In-Reply-To: <9563073.1147131621231.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12456887 ] Michael Busch commented on LUCENE-565: -------------------------------------- What are the reasons to not add the NewIndexModifier to Lucene? This issue has already 6 votes, so it seems to be very popular amongst users (there is only one issue that has more votes). I can say that I'm using it for a couple of months already, it works flawlessly and made my life a lot easier ;-) I think the main objections were that too many changes to IndexWriter were made in the earliest versions of this patch, but with the new merge policy committed, most of the new code is in the new class NewIndexModifier whereas the changes to IndexWriter are minimal. So I would like to encourage committer(s) to take another look, I think this would be a nice feature for the next Lucene release. > Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) > --------------------------------------------------------------------------------- > > Key: LUCENE-565 > URL: http://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Reporter: Ning Li > Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, KeepDocCount0Segment.Sept15.patch, NewIndexModifier.July09.patch, NewIndexModifier.Sept21.patch, NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > ----------- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -------------- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > ------------------- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > ----------------------------------------------------------------------- > Insert only 116 min 119 min 116 min > Insert/delete (big batches) -- 135 min 125 min > Insert/delete (small batches) -- 338 min 134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org