lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Li (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
Date Fri, 08 Sep 2006 17:56:30 GMT
     [ http://issues.apache.org/jira/browse/LUCENE-565?page=all ]

Ning Li updated LUCENE-565:
---------------------------

    Attachment: newMergePolicy.Sept08.patch

This patch features the new more robust merge policy. Reference on the new policy is at http://www.gossamer-threads.com/lists/lucene/java-dev/35147
  - The patch passes all the tests except that one in TestIndexModifier (see an earlier comment
on this issue).
  - Since the test itself has a problem, it is fixed (one line change) and the patch passes
the fixed test.
  - A new test call TestIndexWriterMergePolicy is included which shows the robustness of the
new merge policy.


The following is a detailed description of the new merge policy and its properties.

 Overview of merge policy:

 A flush is triggered either by close() or by the number of ram segments
 reaching maxBufferedDocs. After a disk segment is created by the flush,
 further merges may be triggered.

 LowerBound and upperBound set the limits on the doc count of a segment
 which may be merged. Initially, lowerBound is set to 0 and upperBound
 to maxBufferedDocs. Starting from the rightmost* segment whose doc count
 > lowerBound and <= upperBound, count the number of consecutive segments
 whose doc count <= upperBound.

 Case 1: number of worthy segments < mergeFactor, no merge, done.
 Case 2: number of worthy segments == mergeFactor, merge these segments.
         If the doc count of the merged segment <= upperBound, done.
         Otherwise, set lowerBound to upperBound, and multiply upperBound
         by mergeFactor, go through the process again.
 Case 3: number of worthy segments > mergeFactor (in the case mergeFactor
         M changes), merge the leftmost* M segments. If the doc count of
         the merged segment <= upperBound, consider the merged segment for
         further merges on this same level. Merge the now leftmost* M
         segments, and so on, until number of worthy segments < mergeFactor.
         If the doc count of all the merged segments <= upperBound, done.
         Otherwise, set lowerBound to upperBound, and multiply upperBound
         by mergeFactor, go through the process again.
 Note that case 2 can be considerd as a special case of case 3.

 This merge policy guarantees two invariants if M does not change and
 segment doc count is not reaching maxMergeDocs:
 B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B)))
      1: If i (left*) and i+1 (right*) are two consecutive segments of doc
         counts x and y, then f(x) >= f(y).
      2: The number of committed segments on the same level (f(n)) <= M.


> Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-565
>                 URL: http://issues.apache.org/jira/browse/LUCENE-565
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Ning Li
>         Attachments: IndexWriter.java, IndexWriter.July09.patch, IndexWriter.patch, NewIndexModifier.July09.patch,
NewIndexWriter.Aug23.patch, NewIndexWriter.July18.patch, newMergePolicy.Sept08.patch, perf-test-res.JPG,
perf-test-res2.JPG, perfres.log, TestBufferedDeletesPerf.java, TestWriterDelete.java
>
>
> Today, applications have to open/close an IndexWriter and open/close an
> IndexReader directly or indirectly (via IndexModifier) in order to handle a
> mix of inserts and deletes. This performs well when inserts and deletes
> come in fairly large batches. However, the performance can degrade
> dramatically when inserts and deletes are interleaved in small batches.
> This is because the ramDirectory is flushed to disk whenever an IndexWriter
> is closed, causing a lot of small segments to be created on disk, which
> eventually need to be merged.
> We would like to propose a small API change to eliminate this problem. We
> are aware that this kind change has come up in discusions before. See
> http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049
> . The difference this time is that we have implemented the change and
> tested its performance, as described below.
> API Changes
> -----------
> We propose adding a "deleteDocuments(Term term)" method to IndexWriter.
> Using this method, inserts and deletes can be interleaved using the same
> IndexWriter.
> Note that, with this change it would be very easy to add another method to
> IndexWriter for updating documents, allowing applications to avoid a
> separate delete and insert to update a document.
> Also note that this change can co-exist with the existing APIs for deleting
> documents using an IndexReader. But if our proposal is accepted, we think
> those APIs should probably be deprecated.
> Coding Changes
> --------------
> Coding changes are localized to IndexWriter. Internally, the new
> deleteDocuments() method works by buffering the terms to be deleted.
> Deletes are deferred until the ramDirectory is flushed to disk, either
> because it becomes full or because the IndexWriter is closed. Using Java
> synchronization, care is taken to ensure that an interleaved sequence of
> inserts and deletes for the same document are properly serialized.
> We have attached a modified version of IndexWriter in Release 1.9.1 with
> these changes. Only a few hundred lines of coding changes are needed. All
> changes are commented by "CHANGE". We have also attached a modified version
> of an example from Chapter 2.2 of Lucene in Action.
> Performance Results
> -------------------
> To test the performance our proposed changes, we ran some experiments using
> the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel
> Xeon server running Linux. The disk storage was configured as RAID0 array
> with 5 drives. Before indexes were built, the input documents were parsed
> to remove the HTML from them (i.e., only the text was indexed). This was
> done to minimize the impact of parsing on performance. A simple
> WhitespaceAnalyzer was used during index build.
> We experimented with three workloads:
>   - Insert only. 1.6M documents were inserted and the final
>     index size was 2.3GB.
>   - Insert/delete (big batches). The same documents were
>     inserted, but 25% were deleted. 1000 documents were
>     deleted for every 4000 inserted.
>   - Insert/delete (small batches). In this case, 5 documents
>     were deleted for every 20 inserted.
>                                 current       current          new
> Workload                      IndexWriter  IndexModifier   IndexWriter
> -----------------------------------------------------------------------
> Insert only                     116 min       119 min        116 min
> Insert/delete (big batches)       --          135 min        125 min
> Insert/delete (small batches)     --          338 min        134 min
> As the experiments show, with the proposed changes, the performance
> improved by 60% when inserts and deletes were interleaved in small batches.
> Regards,
> Ning
> Ning Li
> Search Technologies
> IBM Almaden Research Center
> 650 Harry Road
> San Jose, CA 95120

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message