lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
Date Wed, 17 Nov 2010 18:27:13 GMT


Michael McCandless commented on LUCENE-2680:

Deletes flushed can be removed if we store the docid-upto per segment.
Then we'll go back to having a hash map of deletes.

I think we should do this?

Ie, each flushed segment stores the map of del Term/Query to
docid-upto, where that docid-upto is private to the segment (no
remapping on merges needed).

When it's time to apply deletes to about-to-be-merged segments, we
must apply all "future" segments deletions unconditionally to each
segment, and then conditionally (respecting the local docid-upto)
apply that segment's deletions.

Intuitively, yes, however here's the constructor of hash set:

public HashSet() { map = new HashMap<E,Object>(); }

Ugh I forgot about that.  Is that still true?  That's awful.

bq. why are we tracking the last segment info/index?

I thought last segment was supposed to be used to mark the last segment of
a commit/flush. This way we save on the hash(set,map) space on the
segments upto the last segment when the commit occurred.

Hmm... I think lastSegment was needed only for the multiple DWPT
case, to record the last segment already flushed in the index as of
when that DWPT was created.  This is so we know "going back" when we
can start unconditionally apply the buffered delete term.

With the single DWPT we effectively have today isn't last segment
always going to be what we just flushed?  (Or null if we haven't yet
done a flush in the current session).

bq. Do we really need to track appliedTerms/appliedQueries? Ie is this just an optimization
so that if the caller deletes by the Term/Query again we know to skip it?

Yes to the 2nd question. Why would we want to try deleting multiple times?
The cost is the terms dictionary lookup which you're saying is in the
noise? I think potentially cracking open a query again could be costly in
cases where the query is indeed expensive.

I'm saying this is unlikely to be worthwhile way to spend RAM.

EG most apps wouldn't delete by same term again, like they'd
"typically" go and process a big batch of docs, deleting by an id
field and adding the new version of the doc, where a given id is seen
only once in this session, and then IW is committed/closed?

> Improve how IndexWriter flushes deletes against existing segments
> -----------------------------------------------------------------
>                 Key: LUCENE-2680
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 4.0
>         Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch,
LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch
> IndexWriter buffers up all deletes (by Term and Query) and only
> applies them if 1) commit or NRT getReader() is called, or 2) a merge
> is about to kickoff.
> We do this because, for a large index, it's very costly to open a
> SegmentReader for every segment in the index.  So we defer as long as
> we can.  We do it just before merge so that the merge can eliminate
> the deleted docs.
> But, most merges are small, yet in a big index we apply deletes to all
> of the segments, which is really very wasteful.
> Instead, we should only apply the buffered deletes to the segments
> that are about to be merged, and keep the buffer around for the
> remaining segments.
> I think it's not so hard to do; we'd have to have generations of
> pending deletions, because the newly merged segment doesn't need the
> same buffered deletions applied again.  So every time a merge kicks
> off, we pinch off the current set of buffered deletions, open a new
> set (the next generation), and record which segment was created as of
> which generation.
> This should be a very sizable gain for large indices that mix
> deletes, though, less so in flex since opening the terms index is much
> faster.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message