Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 11607 invoked from network); 6 Jul 2006 08:12:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 6 Jul 2006 08:12:52 -0000 Received: (qmail 81013 invoked by uid 500); 6 Jul 2006 08:12:51 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 80725 invoked by uid 500); 6 Jul 2006 08:12:49 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 80714 invoked by uid 99); 6 Jul 2006 08:12:49 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jul 2006 01:12:49 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [209.86.89.66] (HELO elasmtp-spurfowl.atl.sa.earthlink.net) (209.86.89.66) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jul 2006 01:12:48 -0700 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk20050327; d=ix.netcom.com; b=B4bXITAFo3/Im2MvAodESJTN0yYUCQnjrWvHMUxsfS74lQwKps4BoTBFB+YOJpRb; h=Received:Mime-Version:In-Reply-To:References:Content-Type:Message-Id:Content-Transfer-Encoding:From:Subject:Date:To:X-Mailer:X-ELNK-Trace:X-Originating-IP; Received: from [66.245.135.50] (helo=[192.168.1.119]) by elasmtp-spurfowl.atl.sa.earthlink.net with asmtp (TLSv1:RC4-SHA:128) (Exim 4.34) id 1FyOyJ-00024o-Dt for java-dev@lucene.apache.org; Thu, 06 Jul 2006 04:12:27 -0400 Mime-Version: 1.0 (Apple Message framework v752.2) In-Reply-To: <20060706080337.99961.qmail@web50308.mail.yahoo.com> References: <20060706080337.99961.qmail@web50308.mail.yahoo.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <092330F8-18AA-45B2-BC7F-42245812855E@ix.netcom.com> Content-Transfer-Encoding: 7bit From: robert engels Subject: Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) Date: Thu, 6 Jul 2006 03:12:26 -0500 To: java-dev@lucene.apache.org X-Mailer: Apple Mail (2.752.2) X-ELNK-Trace: 33cbdd8ed9881ca8776432462e451d7bd15d05d9470ff7109ba48309c5c01ab59ddbb4bbd8d7e093350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c X-Originating-IP: 66.245.135.50 X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I don't like "mucking up" JIRA with "commentary:. I thought emails were more approproate, and then update JIRA with more pertinent info. Anyway, my test did exercise the small batches, in that in our incremental updates we delete the documents with the unique term, and then add the new (which is what I assumed this was improving), and I saw o appreciable difference. I think a design overview for something as involved as this would be very beneficial - I know the submitter references a previous bug/ email but the provided implementation doesn't seem to match up with that - at least that I could tell. It appears that maybe??? the performance gain is only realized when then newly submitted documents are previously submitted within the same update??? Maybe a test case that demonstrated the performance improvements? On Jul 6, 2006, at 3:03 AM, Otis Gospodnetic wrote: > Robert, it's better to put your comments in JIRA, where Ning Li is > more likely to see them. > > As for performance, it looks like the biggest gain is when one has > small interleaving add/delete batches. It sounds like your app > doesn't have that and has fewer larger add/delete batches. > > I do agree about the complexity there. I couldn't follow > everything either, but saw nothing wrong. More comments would > certainly help. > > Otis > > ----- Original Message ---- > From: robert engels > To: java-dev@lucene.apache.org > Sent: Thursday, July 6, 2006 3:20:02 AM > Subject: Re: [jira] Commented: (LUCENE-565) Supporting > deleteDocuments in IndexWriter (Code and Performance Results Provided) > > I applied the patch, and made code changes to use it. It did not make > any appreciable difference in performance over our current code > (delete using IndexReader and then update the documents using > IndexWriter - each document has a unique "key"). > > I attempted to evaluate the code on its own, but must admit that I > got "lost" a bit. > > Maybe if the submitter could provide a "design overview" of why this > is more efficient, and in what cases it is (and possible degradation > in others) it would be easier to evaluate. > > > On Jul 5, 2006, at 10:25 PM, Otis Gospodnetic (JIRA) wrote: > >> [ http://issues.apache.org/jira/browse/LUCENE-565? >> page=comments#action_12419396 ] >> >> Otis Gospodnetic commented on LUCENE-565: >> ----------------------------------------- >> >> I took a look at the patch and it looks good to me (anyone else had >> a look)? >> Unfortunately, I couldn't get the patch to apply :( >> >> $ patch -F3 < IndexWriter.patch >> (Stripping trailing CRs from patch.) >> patching file IndexWriter.java >> Hunk #1 succeeded at 58 with fuzz 1. >> Hunk #2 succeeded at 112 (offset 2 lines). >> Hunk #4 succeeded at 504 (offset 33 lines). >> Hunk #6 succeeded at 605 with fuzz 2 (offset 57 lines). >> missing header for unified diff at line 259 of patch >> (Stripping trailing CRs from patch.) >> can't find file to patch at input line 259 >> Perhaps you should have used the -p or --strip option? >> The text leading up to this was: >> ... >> ... >> ... >> File to patch: IndexWriter.java >> patching file IndexWriter.java >> Hunk #1 FAILED at 802. >> Hunk #2 succeeded at 745 with fuzz 2 (offset -131 lines). >> 1 out of 2 hunks FAILED -- saving rejects to file >> IndexWriter.java.rej >> >> >> Would it be possible for you to regenerate the patch against >> IndexWriter in HEAD? >> >> Also, I noticed ^Ms in the patch, but I can take care of those >> easily (dos2unix). >> >> Finally, I noticed in 2-3 places that the simple logging via >> "infoStream" variable was removed, for example: >> - if (infoStream != null) infoStream.print("merging segments"); >> >> Perhaps this was just an oversight? >> >> Looking forward to the new patch. Thanks! >> >>> Supporting deleteDocuments in IndexWriter (Code and Performance >>> Results Provided) >>> -------------------------------------------------------------------- >>> - >>> ------------ >>> >>> Key: LUCENE-565 >>> URL: http://issues.apache.org/jira/browse/LUCENE-565 >>> Project: Lucene - Java >>> Type: Bug >> >>> Components: Index >>> Reporter: Ning Li >>> Attachments: IndexWriter.java, IndexWriter.patch, >>> TestWriterDelete.java >>> >>> Today, applications have to open/close an IndexWriter and open/ >>> close an >>> IndexReader directly or indirectly (via IndexModifier) in order to >>> handle a >>> mix of inserts and deletes. This performs well when inserts and >>> deletes >>> come in fairly large batches. However, the performance can degrade >>> dramatically when inserts and deletes are interleaved in small >>> batches. >>> This is because the ramDirectory is flushed to disk whenever an >>> IndexWriter >>> is closed, causing a lot of small segments to be created on disk, >>> which >>> eventually need to be merged. >>> We would like to propose a small API change to eliminate this >>> problem. We >>> are aware that this kind change has come up in discusions before. >>> See >>> http://www.gossamer-threads.com/lists/lucene/java-dev/23049? >>> search_string=indexwriter%20delete;#23049 >>> . The difference this time is that we have implemented the change >>> and >>> tested its performance, as described below. >>> API Changes >>> ----------- >>> We propose adding a "deleteDocuments(Term term)" method to >>> IndexWriter. >>> Using this method, inserts and deletes can be interleaved using >>> the same >>> IndexWriter. >>> Note that, with this change it would be very easy to add another >>> method to >>> IndexWriter for updating documents, allowing applications to avoid a >>> separate delete and insert to update a document. >>> Also note that this change can co-exist with the existing APIs for >>> deleting >>> documents using an IndexReader. But if our proposal is accepted, >>> we think >>> those APIs should probably be deprecated. >>> Coding Changes >>> -------------- >>> Coding changes are localized to IndexWriter. Internally, the new >>> deleteDocuments() method works by buffering the terms to be deleted. >>> Deletes are deferred until the ramDirectory is flushed to disk, >>> either >>> because it becomes full or because the IndexWriter is closed. >>> Using Java >>> synchronization, care is taken to ensure that an interleaved >>> sequence of >>> inserts and deletes for the same document are properly serialized. >>> We have attached a modified version of IndexWriter in Release >>> 1.9.1 with >>> these changes. Only a few hundred lines of coding changes are >>> needed. All >>> changes are commented by "CHANGE". We have also attached a >>> modified version >>> of an example from Chapter 2.2 of Lucene in Action. >>> Performance Results >>> ------------------- >>> To test the performance our proposed changes, we ran some >>> experiments using >>> the TREC WT 10G dataset. The experiments were run on a dual 2.4 >>> Ghz Intel >>> Xeon server running Linux. The disk storage was configured as >>> RAID0 array >>> with 5 drives. Before indexes were built, the input documents were >>> parsed >>> to remove the HTML from them (i.e., only the text was indexed). >>> This was >>> done to minimize the impact of parsing on performance. A simple >>> WhitespaceAnalyzer was used during index build. >>> We experimented with three workloads: >>> - Insert only. 1.6M documents were inserted and the final >>> index size was 2.3GB. >>> - Insert/delete (big batches). The same documents were >>> inserted, but 25% were deleted. 1000 documents were >>> deleted for every 4000 inserted. >>> - Insert/delete (small batches). In this case, 5 documents >>> were deleted for every 20 inserted. >>> current current new >>> Workload IndexWriter IndexModifier >>> IndexWriter >>> -------------------------------------------------------------------- >>> - >>> -- >>> Insert only 116 min 119 min 116 min >>> Insert/delete (big batches) -- 135 min 125 min >>> Insert/delete (small batches) -- 338 min 134 min >>> As the experiments show, with the proposed changes, the performance >>> improved by 60% when inserts and deletes were interleaved in small >>> batches. >>> Regards, >>> Ning >>> Ning Li >>> Search Technologies >>> IBM Almaden Research Center >>> 650 Harry Road >>> San Jose, CA 95120 >> >> -- >> This message is automatically generated by JIRA. >> - >> If you think it was sent incorrectly contact one of the >> administrators: >> http://issues.apache.org/jira/secure/Administrators.jspa >> - >> For more information on JIRA, see: >> http://www.atlassian.com/software/jira >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org