lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
Date Wed, 17 Nov 2010 11:47:17 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932915#action_12932915
] 

Michael McCandless commented on LUCENE-2680:
--------------------------------------------

Why do we still have deletesFlushed?  And why do we still need to
remap docIDs on merge?  I thought with this new approach the docIDUpto
for each buffered delete Term/Query would be a local docID to that
segment?

On flush the deletesInRAM should be carried directly over to the
segmentDeletes, and there shouldn't be a deletesFlushed?

A few other small things:

  * You can use SegmentInfos.clone to copy the segment infos? (it
    makes a deep copy)

  * SegmentDeletes.clearAll() need not iterate through the
    terms/queries to subtract the RAM used?  Ie just multiply by
    .size() instead and make one call to deduct RAM used?

  * The SegmentDeletes use less than BYTES_PER_DEL_TERM because it's a
    simple HashSet not a HashMap?  Ie we are over-counting RAM used
    now?  (Same for by query)

  * Can we store segment's deletes elsewhere?  The SegmentInfo should
    be a lightweight class... eg it's used by DirectoryReader to read
    the index, and if it's read only DirectoryReader there's no need
    for it to allocate the SegmentDeletes?  These data structures
    should only be held by IndexWriter/DocumentsWriter.

  * Do we really need to track appliedTerms/appliedQueries?  Ie is
    this just an optimization so that if the caller deletes by the
    Term/Query again we know to skip it?  Seems unnecessary if that's
    all...


> Improve how IndexWriter flushes deletes against existing segments
> -----------------------------------------------------------------
>
>                 Key: LUCENE-2680
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2680
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
>
>
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message