lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
Date Sat, 27 Nov 2010 20:35:37 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964450#action_12964450
] 

Michael McCandless commented on LUCENE-2680:
--------------------------------------------

So nice to see remapDeletes deleted!

  * Don't forget to remove DocumentsWriter.get/set/updateFlushedDocCount too.

  * Can you move the deletes out of SegmentInfo?  We can just use a
    Map<SegmentInfo,BufferedDeletes>?  But remember to delete segments
    from the map once we commit the merge...

  * I think DocsWriter shouldn't hold onto the SegmentInfos; we should
    pass it in to only those methods that need it.  That SegmentInfos
    is protected under IW's monitor so it makes me nervous if it's
    also a member on DW.

  * Hmm we're no longer accounting for RAM usage of per-segment
    deletes?  I think we need an AtomicInt, which we incr w/ RAM used
    on pushing deletes into a segment, and decr on clearing?

  * The change to the message(...) in DW.applyDeletes is wrong (ie
    switching to deletesInRAM); I think we should just remove the
    details, ie so it says "applying deletes on N segments"?  But then
    add a more detailed message per-segment with the aggregated
    (forward) deletes details?

  * I think we should move this delete handling out of DW as much as
    possible... that's really IW's role (DW is "about" flushing the
    next segment, not tracking details associated with all other
    segments in the index)

  * Instead of adding pushDeletesLastSegment, can we just have IW call
    pushDeletes(lastSegmentInfo)?

  * Calling .getForwardDeletes inside the for loop iterating over the
    infos is actually O(N^2) cost, and it could matter for
    delete-intensive many-segment indices.  Can you change this,
    instead, to walk the infos backwards, incrementally building up
    the forward deletes to apply to each segment by adding in that
    infos deletions?


> Improve how IndexWriter flushes deletes against existing segments
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2680
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2680
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message