lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?
Date Tue, 22 Sep 2009 14:48:13 GMT
John are you using IndexWriter.setMergedSegmentWarmer, so that a newly
merged segment is warmed before it's "put into production" (returned
by getReader)?

Mike

On Mon, Sep 21, 2009 at 9:35 PM, John Wang <john.wang@gmail.com> wrote:
> Jason:
>
>     You are missing the point.
>
>     The idea is to avoid merging of large segments. The point of this
> MergePolicy is to balance segment merges across the index. The aim is not to
> have 1 large segment, it is to have n segments with balanced sizes.
>
>     When the large segment is out of the IO cache, replacing it is very
> costly. What we have done is to split the cost over time by having more
> frequent but faster merges.
>
>     I am not suggesting Lucene's default mergePolicy isn't good, it is just
> not suitable for our case where there are high updates introducing tons of
> deletes. The fact that the api is nice enough to allow MergePolicies to be
> plgged it is a good thing.
>
>     Please DO read the wiki.
>
> -John
>
> On Tue, Sep 22, 2009 at 8:58 AM, Jason Rutherglen
> <jason.rutherglen@gmail.com> wrote:
>>
>> I'm not sure I communicated the idea properly. If CMS is set to
>> 1 thread, no matter how intensive the CPU for a merge, it's
>> limited to 1 core of what is in many cases a 4 or 8 core server.
>> That leaves the other 3 or 7 cores for queries, which if slow,
>> indicates that it isn't the merging that's slowing down queries,
>> but the dumping of the queried segments from the system IO cache.
>>
>> This holds true regardless of the merge policy used. So while a
>> new merge policy sounds great, unless the system IO cache
>> problem is solved, there will always be a lingering problem in
>> regards to large merges with a regularly updated index. Avoiding
>> large merges probably isn't the answer. And
>> LogByteSizeMergePolicy somewhat allows managing the size of the
>> segments merged already. I would personally prefer being able to
>> merge segments up to a given estimated size, which requires
>> LUCENE-1076 to do well.
>>
>> > is rather different from Lucene benchmark as we are testing
>> high updates in a realtime environment
>>
>> Lucene's benchmark allows this. NearRealtimeReaderTask is a good
>> place to start.
>>
>> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <john.wang@gmail.com> wrote:
>> > Jason:
>> >
>> >    Before jumping into any conclusions, let me describe the test setup.
>> > It
>> > is rather different from Lucene benchmark as we are testing high updates
>> > in
>> > a realtime environment:
>> >
>> >    We took a public corpus: medline, indexed to approximately 3 million
>> > docs. And update all the docs over and over again for a 10 hour
>> > duration.
>> >
>> >    Only differences in code used where the different MergePolicy
>> > settings
>> > were applied.
>> >
>> >    Taking the variable of HW/OS out of the equation, let's igonored the
>> > absolute numbers and compare the relative numbers between the two runs.
>> >
>> >    The spike is due to merging of a large segment when we accumulate.
>> > The
>> > graph/perf numbers fit our hypothesis that the default MergePolicy
>> > chooses
>> > to merge small segments before large ones and does not handle segmens
>> > with
>> > high number of deletes well.
>> >
>> >     Merging is BOTH IO and CPU intensive. Especially large ones.
>> >
>> >     I think the wiki explains it pretty well.
>> >
>> >     What are you saying is true with IO cache w.r.t. merge. Everytime
>> > new
>> > files are created, old files in IO cache is invalided. As the experiment
>> > shows, this is detrimental to query performance when large segmens are
>> > being
>> > merged.
>> >
>> >     "As we move to a sharded model of indexes, large merges will
>> > naturally not occur." Our test is on a 3 million document index, not
>> > very
>> > large for a single shard. Some katta people have run it on a much much
>> > larger index per shard. Saying large merges will not occur on indexes of
>> > this size IMHO is unfounded.
>> >
>> > -John
>> >
>> > On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen
>> > <jason.rutherglen@gmail.com> wrote:
>> >>
>> >> John,
>> >>
>> >> It would be great if Lucene's benchmark were used so everyone
>> >> could execute the test in their own environment and verify. It's
>> >> not clear the settings or code used to generate the results so
>> >> it's difficult to draw any reliable conclusions.
>> >>
>> >> The steep spike shows greater evidence for the IO cache being
>> >> cleared during large merges resulting in search performance
>> >> degradation. See:
>> >> http://www.lucidimagination.com/search/?q=madvise
>> >>
>> >> Merging is IO intensive, less CPU intensive, if the
>> >> ConcurrentMergeScheduler is used, which defaults to 3 threads,
>> >> then the CPU could be maxed out. Using a single thread on
>> >> synchronous spinning magnetic media seems more logical. Queries
>> >> are usually the inverse, CPU intensive, not IO intensive when
>> >> the index is in the IO cache. After merging a large segment (or
>> >> during), queries would start hitting disk, and the results
>> >> clearly show that. The queries are suddenly more time consuming
>> >> as they seek on disk at a time when IO activity is at it's peak
>> >> from merging large segments. Using madvise would prevent usable
>> >> indexes from being swapped to disk during a merge, query
>> >> performance would continue unabated.
>> >>
>> >> As we move to a sharded model of indexes, large merges will
>> >> naturally not occur. Shards will reach a specified size and new
>> >> documents will be sent to new shards.
>> >>
>> >> -J
>> >>
>> >> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.wang@gmail.com>
>> >> wrote:
>> >> > The current default Lucene MergePolicy does not handle frequent
>> >> > updates
>> >> > well.
>> >> >
>> >> > We have done some performance analysis with that and a custom merge
>> >> > policy:
>> >> >
>> >> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>> >> >
>> >> > -John
>> >> >
>> >> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
>> >> > jason.rutherglen@gmail.com> wrote:
>> >> >
>> >> >> I opened SOLR-1447 for this
>> >> >>
>> >> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ्
<noble.paul@corp.aol.com>:
>> >> >> > We can use a simple reflection based implementation to simplify
>> >> >> > reading too many parameters.
>> >> >> >
>> >> >> > What I wish to emphasize is that Solr should be agnostic of
xml
>> >> >> > altogether. It should only be aware of specific Objects and
>> >> >> > interfaces. If users wish to plugin something else in some
other
>> >> >> > way
>> >> >> > ,
>> >> >> > it should be fine
>> >> >> >
>> >> >> >
>> >> >> >  There is a huge learning involved in learning the current
>> >> >> > solrconfig.xml . Let us not make people throw away that .
>> >> >> >
>> >> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>> >> >> > <jason.rutherglen@gmail.com> wrote:
>> >> >> >> Over the weekend I may write a patch to allow simple reflection
>> >> >> >> based
>> >> >> >> injection from within solrconfig.
>> >> >> >>
>> >> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> >> >> >> <yonik@lucidimagination.com> wrote:
>> >> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> >> >> >>> <shalinmangar@gmail.com> wrote:
>> >> >> >>>>> I was wondering if there is a way I can modify
>> >> >> >>>>> calibrateSizeByDeletes
>> >> >> just
>> >> >> >>>>> by configuration ?
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>> Alas, no. The only option that I see for you is
to sub-class
>> >> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes
to true
>> >> >> >>>> in
>> >> >> >>>> the
>> >> >> >>>> constructor. However, please open a Jira issue
and so we don't
>> >> >> >>>> forget
>> >> >> about
>> >> >> >>>> it.
>> >> >> >>>
>> >> >> >>> It's the continuing stuff like this that makes me
feel like we
>> >> >> >>> should
>> >> >> >>> be Spring (or equivalent) based someday... I'm just
not sure how
>> >> >> >>> we're
>> >> >> >>> going to get there.
>> >> >> >>>
>> >> >> >>> -Yonik
>> >> >> >>> http://www.lucidimagination.com
>> >> >> >>>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > -----------------------------------------------------
>> >> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message