lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
Date Wed, 17 Nov 2010 23:13:13 GMT


Michael McCandless commented on LUCENE-2680:

I think we may be back tracking here as I had earlier proposed we simply
store each term/query in a map per segment, however I think that was nixed
in favor of last segment + deletes per segment afterwards. We're not
worried about the cost of storing pending deletes in a map per segment

OK sorry now I remember.

Hmm but, my objection then was to carrying all deletes backward to all

Whereas now I think what we can do is only record the deletions that
were added when that segment was a RAM buffer, in its pending deletes
map?  This should be fine, since we aren't storing a single deletion
in multiple places (well, until DWPTs anyway).  It's just that on
applying deletes to a segment because it's about to be merged we have
to do a merge sort of the buffered deletes all "future" segments.

BTW it could also be possible to not necessarily apply deletes when a
segment is merged; eg if there are few enough deletes it may not be
worthwhile.  But we can leave that to another issue.

Additionally we need to decide how accounting'll work for
maxBufferedDeleteTerms. We won't have a centralized place to keep track of
the number of terms, and the unique term count in aggregate over many
segments could be a little too time consuming calculate in a method like
doApplyDeletes. An alternative is to maintain a global unique term count,
such that when a term is added, every other per-segment deletes is checked
for that term, and if it's not already been tallied, we increment the number
of buffered terms.

Maybe we should change the definition to be total number of pending
delete term/queries?  (Ie, not dedup'd across segments).  This seems
reasonable since w/ this new approach the RAM consumed is in
proportion to that total number and not to dedup'd count?

> Improve how IndexWriter flushes deletes against existing segments
> -----------------------------------------------------------------
>                 Key: LUCENE-2680
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 4.0
>         Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message