lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)
Date Tue, 05 Sep 2006 20:20:43 GMT
On 9/2/06, Ning Li <ning.li.li@gmail.com> wrote:
> Here is an outline of the rest of the email:
> 1) Current Lucene merge policy.
> 2) Its strengths and weaknesses. The strengths are good candidates for
> index invariants.
> 3) Changed merge behaviour in the patch.
>
> 1) Current Lucene merge policy
> TargetMergeDocs is a threshold above which a merge will happen.
> Initially, it's set to maxBufferedDocs. Starting from the newest
> segment, sum the doc counts of consecutive segments whose doc count is
> less than targetMergeDocs. If the sum is less than targetMergeDocs, no
> merge. Otherwise, merge those segments, multiply targetMergeDocs by
> mergeFactor, and go through the process again.
>
> Another part of a merge policy is how ram segments are flushed during
> close(). Currently, the doc counts of all the ram segments plus one
> on-disk segment are summed. If the sum is less than or equal to
> mergeFactor, those segments are merged and written to disk. Otherwise,
> only ram segments are merged and written to disk. (Does it make more
> sense to use maxBufferedDocs as the threshold?)

Yes, I think it does make more sense to use maxBufferedDocs.

> 2) Strengths and weakness of the current merge policy
> The following notations will be used:
> B = maxBufferedDocs
> M = mergeFactor
> f(n) = floor(log_M (n / B)). If f(n) = c, it means B*(M^c) <= n < B*(M^(c+1)).
>
> The current merge policy has two nice properties:
>   1 If i and i+1 (i older, i+1 newer) are two consecutive segments of
> doc counts x and y with y >= B, then f(x) >= f(y). In other words, if
> B*(M^c) <= y < B*(M^(c+1)), then x >= B*(M^c).
>   2 Less than M number of segments whose doc count y satisfies B*(M^c)
> <= y < B*(M^(c+1)) for any c >= 0. From property 1 we know segments
> with the same f(y) are consecutive.
> They can be proved by induction.
>
> It also has two weaknesses:
>   1 It doesn't take deletes into consideration. DocCount was used
> above, not numDocs. So it's possible that a large segment with a lot
> of deleted documents won't get cleaned up until much later.
>   2 Property 2 above only says about segments whose doc count is >= B.
> The number of on-disk segments whose doc count is < B could be well
> above M because of how ram segments are flushed. For example, if
> B=1000, M=10, and we close the index every time 10 documents are
> inserted, we could have 99 on-disk segments each with doc count 10.
> Does it make more sense to use B as the threshold during ram flush
> instead of M? Or is it better to simply set a max number of on-disk
> segments whose doc count is < B?

What about an invariant that says the number of main index segments
with the same level (f(n)) should be less than M.

I am concerned about corner cases causing tons of segments and slowing
search or causing errors due to file descriptor exhaustion.

When merging, maybe we should count the number of segments at a
particular index level f(n), rather than adding up the number of
documents.  In the presence of deletions, this should lead to faster
indexing (due to less frequent merges) I think.

> 3) Changed merge behaviour in the patch
> In the patch, when a merge happens, the segments being merged are
> either all in ram or all on disk, but not both. Because of this, it
> has a similar weakness as the second weakness of the current merge
> policy, but worse. The rest are the same: the same two strengths and
> the same first weakness.
>
> I'm looking forward to more discussions on the index invarients and
> how the current merge policy could be improved etc. So although the
> patch could be modified to match the exact behaviour of the current
> merge policy, I'll wait until we reach some agreement.

What is the behavior of your patch under the current scenario:

M=10, B=1000
open writer, add 3 docs, close writer
open writer, add 1000 docs, close writer

Do you avoid the situation of having segments with docs=3 and 1000
(hence f(n) increases as you increase segment numbers... a no-no)?


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message