lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch in Solr ?
Date Tue, 22 Sep 2009 01:35:47 GMT
Jason:

    You are missing the point.

    The idea is to avoid merging of large segments. The point of this
MergePolicy is to balance segment merges across the index. The aim is not to
have 1 large segment, it is to have n segments with balanced sizes.

    When the large segment is out of the IO cache, replacing it is very
costly. What we have done is to split the cost over time by having more
frequent but faster merges.

    I am not suggesting Lucene's default mergePolicy isn't good, it is just
not suitable for our case where there are high updates introducing tons of
deletes. The fact that the api is nice enough to allow MergePolicies to be
plgged it is a good thing.

    Please DO read the wiki.

-John

On Tue, Sep 22, 2009 at 8:58 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> I'm not sure I communicated the idea properly. If CMS is set to
> 1 thread, no matter how intensive the CPU for a merge, it's
> limited to 1 core of what is in many cases a 4 or 8 core server.
> That leaves the other 3 or 7 cores for queries, which if slow,
> indicates that it isn't the merging that's slowing down queries,
> but the dumping of the queried segments from the system IO cache.
>
> This holds true regardless of the merge policy used. So while a
> new merge policy sounds great, unless the system IO cache
> problem is solved, there will always be a lingering problem in
> regards to large merges with a regularly updated index. Avoiding
> large merges probably isn't the answer. And
> LogByteSizeMergePolicy somewhat allows managing the size of the
> segments merged already. I would personally prefer being able to
> merge segments up to a given estimated size, which requires
> LUCENE-1076 to do well.
>
> > is rather different from Lucene benchmark as we are testing
> high updates in a realtime environment
>
> Lucene's benchmark allows this. NearRealtimeReaderTask is a good
> place to start.
>
> On Mon, Sep 21, 2009 at 4:50 PM, John Wang <john.wang@gmail.com> wrote:
> > Jason:
> >
> >    Before jumping into any conclusions, let me describe the test setup.
> It
> > is rather different from Lucene benchmark as we are testing high updates
> in
> > a realtime environment:
> >
> >    We took a public corpus: medline, indexed to approximately 3 million
> > docs. And update all the docs over and over again for a 10 hour duration.
> >
> >    Only differences in code used where the different MergePolicy settings
> > were applied.
> >
> >    Taking the variable of HW/OS out of the equation, let's igonored the
> > absolute numbers and compare the relative numbers between the two runs.
> >
> >    The spike is due to merging of a large segment when we accumulate. The
> > graph/perf numbers fit our hypothesis that the default MergePolicy
> chooses
> > to merge small segments before large ones and does not handle segmens
> with
> > high number of deletes well.
> >
> >     Merging is BOTH IO and CPU intensive. Especially large ones.
> >
> >     I think the wiki explains it pretty well.
> >
> >     What are you saying is true with IO cache w.r.t. merge. Everytime new
> > files are created, old files in IO cache is invalided. As the experiment
> > shows, this is detrimental to query performance when large segmens are
> being
> > merged.
> >
> >     "As we move to a sharded model of indexes, large merges will
> > naturally not occur." Our test is on a 3 million document index, not very
> > large for a single shard. Some katta people have run it on a much much
> > larger index per shard. Saying large merges will not occur on indexes of
> > this size IMHO is unfounded.
> >
> > -John
> >
> > On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen
> > <jason.rutherglen@gmail.com> wrote:
> >>
> >> John,
> >>
> >> It would be great if Lucene's benchmark were used so everyone
> >> could execute the test in their own environment and verify. It's
> >> not clear the settings or code used to generate the results so
> >> it's difficult to draw any reliable conclusions.
> >>
> >> The steep spike shows greater evidence for the IO cache being
> >> cleared during large merges resulting in search performance
> >> degradation. See:
> >> http://www.lucidimagination.com/search/?q=madvise
> >>
> >> Merging is IO intensive, less CPU intensive, if the
> >> ConcurrentMergeScheduler is used, which defaults to 3 threads,
> >> then the CPU could be maxed out. Using a single thread on
> >> synchronous spinning magnetic media seems more logical. Queries
> >> are usually the inverse, CPU intensive, not IO intensive when
> >> the index is in the IO cache. After merging a large segment (or
> >> during), queries would start hitting disk, and the results
> >> clearly show that. The queries are suddenly more time consuming
> >> as they seek on disk at a time when IO activity is at it's peak
> >> from merging large segments. Using madvise would prevent usable
> >> indexes from being swapped to disk during a merge, query
> >> performance would continue unabated.
> >>
> >> As we move to a sharded model of indexes, large merges will
> >> naturally not occur. Shards will reach a specified size and new
> >> documents will be sent to new shards.
> >>
> >> -J
> >>
> >> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.wang@gmail.com>
> wrote:
> >> > The current default Lucene MergePolicy does not handle frequent
> updates
> >> > well.
> >> >
> >> > We have done some performance analysis with that and a custom merge
> >> > policy:
> >> >
> >> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
> >> >
> >> > -John
> >> >
> >> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
> >> > jason.rutherglen@gmail.com> wrote:
> >> >
> >> >> I opened SOLR-1447 for this
> >> >>
> >> >> 2009/9/18 Noble Paul നോബിള്‍  नोब्ळ् <noble.paul@corp.aol.com>:
> >> >> > We can use a simple reflection based implementation to simplify
> >> >> > reading too many parameters.
> >> >> >
> >> >> > What I wish to emphasize is that Solr should be agnostic of xml
> >> >> > altogether. It should only be aware of specific Objects and
> >> >> > interfaces. If users wish to plugin something else in some other
> way
> >> >> > ,
> >> >> > it should be fine
> >> >> >
> >> >> >
> >> >> >  There is a huge learning involved in learning the current
> >> >> > solrconfig.xml . Let us not make people throw away that .
> >> >> >
> >> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
> >> >> > <jason.rutherglen@gmail.com> wrote:
> >> >> >> Over the weekend I may write a patch to allow simple reflection
> >> >> >> based
> >> >> >> injection from within solrconfig.
> >> >> >>
> >> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
> >> >> >> <yonik@lucidimagination.com> wrote:
> >> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
> >> >> >>> <shalinmangar@gmail.com> wrote:
> >> >> >>>>> I was wondering if there is a way I can modify
> >> >> >>>>> calibrateSizeByDeletes
> >> >> just
> >> >> >>>>> by configuration ?
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>> Alas, no. The only option that I see for you is to
sub-class
> >> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes
to true in
> >> >> >>>> the
> >> >> >>>> constructor. However, please open a Jira issue and
so we don't
> >> >> >>>> forget
> >> >> about
> >> >> >>>> it.
> >> >> >>>
> >> >> >>> It's the continuing stuff like this that makes me feel
like we
> >> >> >>> should
> >> >> >>> be Spring (or equivalent) based someday... I'm just not
sure how
> >> >> >>> we're
> >> >> >>> going to get there.
> >> >> >>>
> >> >> >>> -Yonik
> >> >> >>> http://www.lucidimagination.com
> >> >> >>>
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > -----------------------------------------------------
> >> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
> >> >> >
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message