lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: MergePolicy Thresholds
Date Mon, 02 May 2011 14:31:02 GMT
>
> The problem is - each person needs his own set of knobs (or thinks he
> needs them) for MergePolicy, and I can't call any of these sets
> superior to others :/
>

I agree. I wonder tough if the knobs we give on LogMP are intuitive enough.

It neatly avoids uber-merges
>

I didn't see that I can define what "uber-merge" is, right? Can I tell it to
stop merging segments of some size? E.g., if my index grew to 100 segments,
40GB each, I don't think that merging 10 40GB segments (to create 400GB
segment) is going to speed up my search, for instance. A 40GB segment
(probably much less) is already big enough to not be touched anymore.

Will BalancedMP stop merging such segments (if all segments are of that
order of magnitude)?

Shai

On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot <earwin@gmail.com> wrote:

> Dunno, I'm quite happy with numLargeSegments (you critically
> misspelled it). It neatly avoids uber-merges, keeps the number of
> segments at bay, and does not require to recalculate thresholds when
> my expected index size changes.
>
> The problem is - each person needs his own set of knobs (or thinks he
> needs them) for MergePolicy, and I can't call any of these sets
> superior to others :/
>
> 2011/5/2 Shai Erera <serera@gmail.com>:
> > I did look at it, but I didn't find that it answers this particular need
> > (ending with a segment no bigger than X). Perhaps by tweaking several
> > parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can
> achieve
> > something, but it's not very clear what is the right combination.
> >
> > Which is related to one of the points -- is it not more intuitive for an
> app
> > to set this threshold (if it needs any thresholds), than tweaking all of
> > those parameters? If so, then we only need two thresholds (size +
> > mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
> > (perhaps w/ some adaptations) to derive a merge plan.
> >
> > Shai
> >
> > On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot <earwin@gmail.com>
> wrote:
> >>
> >> Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
> >>
> >> On Mon, May 2, 2011 at 17:03, Shai Erera <serera@gmail.com> wrote:
> >> > Hi
> >> >
> >> > Today, LogMP allows you to set different thresholds for segments
> sizes,
> >> > thereby allowing you to control the largest segment that will be
> >> > considered for merge + the largest segment your index will hold (=~
> >> > threshold * mergeFactor).
> >> >
> >> > So, if you want to end up w/ say 20GB segments, you can set
> >> > maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
> >> >
> >> > However, this often does not achieve your desired goal -- if the index
> >> > contains 5 and 7 GB segments, they will never be merged b/c they are
> >> > bigger than the threshold. I am willing to spend the CPU and IO
> >> > resources
> >> > to end up w/ 20 GB segments, whether I'm merging 10 segments together
> or
> >> > only 2. After I reach a 20GB segment, it can rest peacefully, at least
> >> > until I increase the threshold.
> >> >
> >> > So I wonder, first, if this threshold (i.e., largest segment size you
> >> > would like to end up with) is more natural to set than thee current
> >> > thresholds,
> >> > from the application level? I.e., wouldn't it be a simpler threshold
> to
> >> > set
> >> > instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
> >> > and mergeFactor?
> >> >
> >> > Second, should this be an addition to LogMP, or a different
> >> > type of MP. One that adheres to only those two factors (perhaps the
> >> > segSize threshold should be allowed to set differently for optimize
> and
> >> > regular merges). It can pick segments for merge such that it maximizes
> >> > the result segment size (i.e., don't necessarily merge in sequential
> >> > order), but not more than mergeFactor.
> >> >
> >> > I guess, if we think that maxResultSegmentSizeMB is more intuitive
> than
> >> > the current thresholds, application-wise, then this change should go
> >> > into LogMP. Otherwise, it feels like a different MP is needed, because
> >> > LogMP is already complicated and another threshold would confuse
> things.
> >> >
> >> > What do you think of this? Am I trying to optimize too much? :)
> >> >
> >> > Shai
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Kirill Zakharenko/Кирилл Захаренко
> >> E-Mail/Jabber: earwin@gmail.com
> >> Phone: +7 (495) 683-567-4
> >> ICQ: 104465785
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> >
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко
> E-Mail/Jabber: earwin@gmail.com
> Phone: +7 (495) 683-567-4
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message