lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <ear...@gmail.com>
Subject Re: MergePolicy Thresholds
Date Mon, 02 May 2011 21:44:49 GMT
>> The problem is - each person needs his own set of knobs (or thinks he
>> needs them) for MergePolicy, and I can't call any of these sets
>> superior to others :/
>
> I agree. I wonder tough if the knobs we give on LogMP are intuitive enough.
>
>> It neatly avoids uber-merges
>
> I didn't see that I can define what "uber-merge" is, right? Can I tell it to
> stop merging segments of some size? E.g., if my index grew to 100 segments,
> 40GB each, I don't think that merging 10 40GB segments (to create 400GB
> segment) is going to speed up my search, for instance. A 40GB segment
> (probably much less) is already big enough to not be touched anymore.
No, you can't. But you can tell it to have exactly (not 'at most') N
top-tier segments and try to keep their sizes close with merges.
Whatever that size may be.
And this is exactly what I want. And defining max cap on segment size
is not what I want.

So the same set of knobs can be intuitive and meaningful for one
person, and useless for another. And you can't pick the "best" one.

> Will BalancedMP stop merging such segments (if all segments are of that
> order of magnitude)?
>
> Shai
>
> On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot <earwin@gmail.com> wrote:
>>
>> Dunno, I'm quite happy with numLargeSegments (you critically
>> misspelled it). It neatly avoids uber-merges, keeps the number of
>> segments at bay, and does not require to recalculate thresholds when
>> my expected index size changes.
>>
>> The problem is - each person needs his own set of knobs (or thinks he
>> needs them) for MergePolicy, and I can't call any of these sets
>> superior to others :/
>>
>> 2011/5/2 Shai Erera <serera@gmail.com>:
>> > I did look at it, but I didn't find that it answers this particular need
>> > (ending with a segment no bigger than X). Perhaps by tweaking several
>> > parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can
>> > achieve
>> > something, but it's not very clear what is the right combination.
>> >
>> > Which is related to one of the points -- is it not more intuitive for an
>> > app
>> > to set this threshold (if it needs any thresholds), than tweaking all of
>> > those parameters? If so, then we only need two thresholds (size +
>> > mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
>> > (perhaps w/ some adaptations) to derive a merge plan.
>> >
>> > Shai
>> >
>> > On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot <earwin@gmail.com>
>> > wrote:
>> >>
>> >> Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
>> >>
>> >> On Mon, May 2, 2011 at 17:03, Shai Erera <serera@gmail.com> wrote:
>> >> > Hi
>> >> >
>> >> > Today, LogMP allows you to set different thresholds for segments
>> >> > sizes,
>> >> > thereby allowing you to control the largest segment that will be
>> >> > considered for merge + the largest segment your index will hold (=~
>> >> > threshold * mergeFactor).
>> >> >
>> >> > So, if you want to end up w/ say 20GB segments, you can set
>> >> > maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
>> >> >
>> >> > However, this often does not achieve your desired goal -- if the
>> >> > index
>> >> > contains 5 and 7 GB segments, they will never be merged b/c they are
>> >> > bigger than the threshold. I am willing to spend the CPU and IO
>> >> > resources
>> >> > to end up w/ 20 GB segments, whether I'm merging 10 segments together
>> >> > or
>> >> > only 2. After I reach a 20GB segment, it can rest peacefully, at
>> >> > least
>> >> > until I increase the threshold.
>> >> >
>> >> > So I wonder, first, if this threshold (i.e., largest segment size you
>> >> > would like to end up with) is more natural to set than thee current
>> >> > thresholds,
>> >> > from the application level? I.e., wouldn't it be a simpler threshold
>> >> > to
>> >> > set
>> >> > instead of doing weird calculus that depend on
>> >> > maxMergeMB(ForOptimize)
>> >> > and mergeFactor?
>> >> >
>> >> > Second, should this be an addition to LogMP, or a different
>> >> > type of MP. One that adheres to only those two factors (perhaps the
>> >> > segSize threshold should be allowed to set differently for optimize
>> >> > and
>> >> > regular merges). It can pick segments for merge such that it
>> >> > maximizes
>> >> > the result segment size (i.e., don't necessarily merge in sequential
>> >> > order), but not more than mergeFactor.
>> >> >
>> >> > I guess, if we think that maxResultSegmentSizeMB is more intuitive
>> >> > than
>> >> > the current thresholds, application-wise, then this change should go
>> >> > into LogMP. Otherwise, it feels like a different MP is needed,
>> >> > because
>> >> > LogMP is already complicated and another threshold would confuse
>> >> > things.
>> >> >
>> >> > What do you think of this? Am I trying to optimize too much? :)
>> >> >
>> >> > Shai
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Kirill Zakharenko/Кирилл Захаренко
>> >> E-Mail/Jabber: earwin@gmail.com
>> >> Phone: +7 (495) 683-567-4
>> >> ICQ: 104465785
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко
>> E-Mail/Jabber: earwin@gmail.com
>> Phone: +7 (495) 683-567-4
>> ICQ: 104465785
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: earwin@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message