lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <>
Subject [jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
Date Wed, 17 Nov 2010 19:03:15 GMT


Jason Rutherglen commented on LUCENE-2680:

DWPT deletes has perhaps confused this issue a little bit. 

{quote}Tracking per-segment would be easier but I worry about indices that
have large numbers of segments... eg w/ a large mergeFactor and frequent
flushing you can get very many segments.{quote}

I think we may be back tracking here as I had earlier proposed we simply
store each term/query in a map per segment, however I think that was nixed
in favor of last segment + deletes per segment afterwards. We're not
worried about the cost of storing pending deletes in a map per segment

{quote}With the single DWPT we effectively have today isn't last segment
always going to be what we just flushed? (Or null if we haven't yet done a
flush in the current session).{quote}

Pretty much. 

{quote}EG most apps wouldn't delete by same term again, like they'd
"typically" go and process a big batch of docs, deleting by an id field
and adding the new version of the doc, where a given id is seen only once
in this session, and then IW is committed/closed?{quote}

In an extreme RT app that uses Lucene like a database, it could in fact
update a doc many times, then we'd start accumulating and deleting the
same ID over and over again. However in the straight batch indexing model
outlined, that is unlikely to happen. 

{quote}When it's time to apply deletes to about-to-be-merged segments, we
must apply all "future" segments deletions unconditionally to each
segment, and then conditionally (respecting the local docid-upto) apply
that segment's deletions.{quote}

I'll use this as the go-ahead design then.

bq. Is that still true?

That's from Java 1.6.

> Improve how IndexWriter flushes deletes against existing segments
> -----------------------------------------------------------------
>                 Key: LUCENE-2680
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 4.0
>         Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message