Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 67857 invoked from network); 10 Feb 2007 10:08:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Feb 2007 10:08:30 -0000 Received: (qmail 20446 invoked by uid 500); 10 Feb 2007 10:08:35 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 20399 invoked by uid 500); 10 Feb 2007 10:08:35 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 20388 invoked by uid 99); 10 Feb 2007 10:08:34 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Feb 2007 02:08:34 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Feb 2007 02:08:27 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id BFC4F714313 for ; Sat, 10 Feb 2007 02:08:06 -0800 (PST) Message-ID: <18598684.1171102086782.JavaMail.jira@brutus> Date: Sat, 10 Feb 2007 02:08:06 -0800 (PST) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) In-Reply-To: <9563073.1147131621231.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-565: -------------------------------------- Attachment: LUCENE-565.Feb2007.patch OK I moved NewIndexModifier's methods into IndexWriter and did some small refactoring, tightening up protections, fixed javadocs, indentation, etc. NewIndexModifier is now removed. I like this solution much better! I also increased the default number of deleted terms before a flush is triggered from 10 to 1000. These buffered terms use very little memory so I think it makes sense to have a larger default? So, this adds these public methods to IndexWriter: public void updateDocument(Term term, Document doc, Analyzer analyzer) public void updateDocument(Term term, Document doc) public synchronized void deleteDocuments(Term[] terms) public synchronized void deleteDocuments(Term term) public void setMaxBufferedDeleteTerms(int maxBufferedDeleteTerms) public int getMaxBufferedDeleteTerms() And this public field: public final static int DEFAULT_MAX_BUFFERED_DELETE_TERMS = 10; On the extensions points, we had previously added these 4: protected void doAfterFlushRamSegments(boolean flushedRamSegments) protected boolean timeToFlushRam() protected boolean anythingToFlushRam() protected boolean onlyRamDocsToFlush() I would propose that instead we add only the first one above, but rename it to "doAfterFlush()". This is basically a callback that a subclass could use to do its own thing after a flush but before a commit. But then I don't think we should add any of the others. The "timeToFlushRam()" callback isn't really needed now that we have a public "flush()" method. And the other two are very specific to how IndexWriter implements RAM buffering/flushing and so unless/until we can think of a use case that needs these I'm inclined to not include them? Yonik, is there something in Solr that would need these last 2 callbacks? I've attached the patch (LUCENE-565.Feb2007.patch) with these changes! > Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) > --------------------------------------------------------------------------------- > > Key: LUCENE-565 > URL: https://issues.apache.org/jira/browse/LUCENE-565 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Reporter: Ning Li > Assigned To: Michael McCandless > Fix For: 2.1 > > Attachments: LUCENE-565.Feb2007.patch, NewIndexModifier.Jan2007.patch, NewIndexModifier.Jan2007.take2.patch, NewIndexModifier.Jan2007.take3.patch, NewIndexModifier.Sept21.patch, perf-test-res.JPG, perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java > > > Today, applications have to open/close an IndexWriter and open/close an > IndexReader directly or indirectly (via IndexModifier) in order to handle a > mix of inserts and deletes. This performs well when inserts and deletes > come in fairly large batches. However, the performance can degrade > dramatically when inserts and deletes are interleaved in small batches. > This is because the ramDirectory is flushed to disk whenever an IndexWriter > is closed, causing a lot of small segments to be created on disk, which > eventually need to be merged. > We would like to propose a small API change to eliminate this problem. We > are aware that this kind change has come up in discusions before. See > http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 > . The difference this time is that we have implemented the change and > tested its performance, as described below. > API Changes > ----------- > We propose adding a "deleteDocuments(Term term)" method to IndexWriter. > Using this method, inserts and deletes can be interleaved using the same > IndexWriter. > Note that, with this change it would be very easy to add another method to > IndexWriter for updating documents, allowing applications to avoid a > separate delete and insert to update a document. > Also note that this change can co-exist with the existing APIs for deleting > documents using an IndexReader. But if our proposal is accepted, we think > those APIs should probably be deprecated. > Coding Changes > -------------- > Coding changes are localized to IndexWriter. Internally, the new > deleteDocuments() method works by buffering the terms to be deleted. > Deletes are deferred until the ramDirectory is flushed to disk, either > because it becomes full or because the IndexWriter is closed. Using Java > synchronization, care is taken to ensure that an interleaved sequence of > inserts and deletes for the same document are properly serialized. > We have attached a modified version of IndexWriter in Release 1.9.1 with > these changes. Only a few hundred lines of coding changes are needed. All > changes are commented by "CHANGE". We have also attached a modified version > of an example from Chapter 2.2 of Lucene in Action. > Performance Results > ------------------- > To test the performance our proposed changes, we ran some experiments using > the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel > Xeon server running Linux. The disk storage was configured as RAID0 array > with 5 drives. Before indexes were built, the input documents were parsed > to remove the HTML from them (i.e., only the text was indexed). This was > done to minimize the impact of parsing on performance. A simple > WhitespaceAnalyzer was used during index build. > We experimented with three workloads: > - Insert only. 1.6M documents were inserted and the final > index size was 2.3GB. > - Insert/delete (big batches). The same documents were > inserted, but 25% were deleted. 1000 documents were > deleted for every 4000 inserted. > - Insert/delete (small batches). In this case, 5 documents > were deleted for every 20 inserted. > current current new > Workload IndexWriter IndexModifier IndexWriter > ----------------------------------------------------------------------- > Insert only 116 min 119 min 116 min > Insert/delete (big batches) -- 135 min 125 min > Insert/delete (small batches) -- 338 min 134 min > As the experiments show, with the proposed changes, the performance > improved by 60% when inserts and deletes were interleaved in small batches. > Regards, > Ning > Ning Li > Search Technologies > IBM Almaden Research Center > 650 Harry Road > San Jose, CA 95120 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org