Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of jason.rutherglen@gmail.com
 designates 209.85.221.188 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=SpLBbu6VtKpLjdxV9alQEKzauCkK3O+H9uavIGTK3Z56ABeMl4/HvDsu90PUuezh/E
         uoAN6YS7Vg6iL2MYvsniMJrqF7X+J2k90hicoYqqivSuqMxoX254iTO3r7W6VwgOW4XG
         LmaXcCBNc0KJgFk6lyZG0R0SlOpQupF5NwD7U=
MIME-Version: 1.0
In-Reply-To: <8837fb770909211650x19f6dbc1nc1e0f621827d4906@mail.gmail.com>
References: <62AB8D44-ABF1-415C-B70E-8CE97B967EE2@mac.com>
	 <69de18140909171330x1b6ea3d1t5738037666601004@mail.gmail.com>
	 <c68e39170909180810q6f550970v4cf20af2d467679c@mail.gmail.com>
	 <85d3c3b60909181329v51de1ef9tdddf2aba0774afc4@mail.gmail.com>
	 <5e76b0ad0909182201m4cd78f56i17f33e548b3dae44@mail.gmail.com>
	 <85d3c3b60909202208td195083p17d84e98ebb24e06@mail.gmail.com>
	 <8837fb770909202312u525bc73dh22dd1d251098e5b3@mail.gmail.com>
	 <85d3c3b60909211134j7e51addfpcad82e86353387e4@mail.gmail.com>
	 <8837fb770909211650x19f6dbc1nc1e0f621827d4906@mail.gmail.com>
Date: Mon, 21 Sep 2009 17:58:02 -0700
Message-ID: <85d3c3b60909211758l7871d573ka51340e6c449367f@mail.gmail.com>
Subject: Re: How to leverage the LogMergePolicy "calibrateSizeByDeletes" patch
	in Solr ?
From: Jason Rutherglen <jason.rutherglen@gmail.com>
To: java-dev@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I'm not sure I communicated the idea properly. If CMS is set to
1 thread, no matter how intensive the CPU for a merge, it's
limited to 1 core of what is in many cases a 4 or 8 core server.
That leaves the other 3 or 7 cores for queries, which if slow,
indicates that it isn't the merging that's slowing down queries,
but the dumping of the queried segments from the system IO cache.

This holds true regardless of the merge policy used. So while a
new merge policy sounds great, unless the system IO cache
problem is solved, there will always be a lingering problem in
regards to large merges with a regularly updated index. Avoiding
large merges probably isn't the answer. And
LogByteSizeMergePolicy somewhat allows managing the size of the
segments merged already. I would personally prefer being able to
merge segments up to a given estimated size, which requires
LUCENE-1076 to do well.

> is rather different from Lucene benchmark as we are testing
high updates in a realtime environment

Lucene's benchmark allows this. NearRealtimeReaderTask is a good
place to start.

On Mon, Sep 21, 2009 at 4:50 PM, John Wang <john.wang@gmail.com> wrote:
> Jason:
>
> =C2=A0=C2=A0 Before jumping into any conclusions, let me describe the tes=
t setup. It
> is rather different from Lucene benchmark as we are testing high updates =
in
> a realtime environment:
>
> =C2=A0=C2=A0 We took a public corpus: medline, indexed to approximately 3=
 million
> docs. And update all the docs over and over again for a 10 hour duration.
>
> =C2=A0=C2=A0 Only differences in code used where the different MergePolic=
y settings
> were applied.
>
> =C2=A0=C2=A0 Taking the variable of HW/OS out of the equation, let's igon=
ored the
> absolute numbers and compare the relative numbers between the two runs.
>
> =C2=A0=C2=A0 The spike is due to merging of a large segment when we accum=
ulate. The
> graph/perf numbers fit our hypothesis that the default MergePolicy choose=
s
> to merge small segments before large ones and does not handle segmens wit=
h
> high number of deletes well.
>
> =C2=A0=C2=A0=C2=A0 Merging is BOTH IO and CPU intensive. Especially large=
 ones.
>
> =C2=A0=C2=A0=C2=A0 I think the wiki explains it pretty well.
>
> =C2=A0=C2=A0=C2=A0 What are you saying is true with IO cache w.r.t. merge=
. Everytime new
> files are created, old files in IO cache is invalided. As the experiment
> shows, this is detrimental to query performance when large segmens are be=
ing
> merged.
>
> =C2=A0=C2=A0=C2=A0 "As we move to a sharded model of indexes, large merge=
s will
> naturally not occur." Our test is on a 3 million document index, not very
> large for a single shard. Some katta people have run it on a much much
> larger index per shard. Saying large merges will not occur on indexes of
> this size IMHO is unfounded.
>
> -John
>
> On Tue, Sep 22, 2009 at 2:34 AM, Jason Rutherglen
> <jason.rutherglen@gmail.com> wrote:
>>
>> John,
>>
>> It would be great if Lucene's benchmark were used so everyone
>> could execute the test in their own environment and verify. It's
>> not clear the settings or code used to generate the results so
>> it's difficult to draw any reliable conclusions.
>>
>> The steep spike shows greater evidence for the IO cache being
>> cleared during large merges resulting in search performance
>> degradation. See:
>> http://www.lucidimagination.com/search/?q=3Dmadvise
>>
>> Merging is IO intensive, less CPU intensive, if the
>> ConcurrentMergeScheduler is used, which defaults to 3 threads,
>> then the CPU could be maxed out. Using a single thread on
>> synchronous spinning magnetic media seems more logical. Queries
>> are usually the inverse, CPU intensive, not IO intensive when
>> the index is in the IO cache. After merging a large segment (or
>> during), queries would start hitting disk, and the results
>> clearly show that. The queries are suddenly more time consuming
>> as they seek on disk at a time when IO activity is at it's peak
>> from merging large segments. Using madvise would prevent usable
>> indexes from being swapped to disk during a merge, query
>> performance would continue unabated.
>>
>> As we move to a sharded model of indexes, large merges will
>> naturally not occur. Shards will reach a specified size and new
>> documents will be sent to new shards.
>>
>> -J
>>
>> On Sun, Sep 20, 2009 at 11:12 PM, John Wang <john.wang@gmail.com> wrote:
>> > The current default Lucene MergePolicy does not handle frequent update=
s
>> > well.
>> >
>> > We have done some performance analysis with that and a custom merge
>> > policy:
>> >
>> > http://code.google.com/p/zoie/wiki/ZoieMergePolicy
>> >
>> > -John
>> >
>> > On Mon, Sep 21, 2009 at 1:08 PM, Jason Rutherglen <
>> > jason.rutherglen@gmail.com> wrote:
>> >
>> >> I opened SOLR-1447 for this
>> >>
>> >> 2009/9/18 Noble Paul =E0=B4=A8=E0=B5=8B=E0=B4=AC=E0=B4=BF=E0=B4=B3=E0=
=B5=8D=E2=80=8D =C2=A0=E0=A4=A8=E0=A5=8B=E0=A4=AC=E0=A5=8D=E0=A4=B3=E0=A5=
=8D <noble.paul@corp.aol.com>:
>> >> > We can use a simple reflection based implementation to simplify
>> >> > reading too many parameters.
>> >> >
>> >> > What I wish to emphasize is that Solr should be agnostic of xml
>> >> > altogether. It should only be aware of specific Objects and
>> >> > interfaces. If users wish to plugin something else in some other wa=
y
>> >> > ,
>> >> > it should be fine
>> >> >
>> >> >
>> >> > =C2=A0There is a huge learning involved in learning the current
>> >> > solrconfig.xml . Let us not make people throw away that .
>> >> >
>> >> > On Sat, Sep 19, 2009 at 1:59 AM, Jason Rutherglen
>> >> > <jason.rutherglen@gmail.com> wrote:
>> >> >> Over the weekend I may write a patch to allow simple reflection
>> >> >> based
>> >> >> injection from within solrconfig.
>> >> >>
>> >> >> On Fri, Sep 18, 2009 at 8:10 AM, Yonik Seeley
>> >> >> <yonik@lucidimagination.com> wrote:
>> >> >>> On Thu, Sep 17, 2009 at 4:30 PM, Shalin Shekhar Mangar
>> >> >>> <shalinmangar@gmail.com> wrote:
>> >> >>>>> I was wondering if there is a way I can modify
>> >> >>>>> calibrateSizeByDeletes
>> >> just
>> >> >>>>> by configuration ?
>> >> >>>>>
>> >> >>>>
>> >> >>>> Alas, no. The only option that I see for you is to sub-class
>> >> >>>> LogByteSizeMergePolicy and set calibrateSizeByDeletes to true in
>> >> >>>> the
>> >> >>>> constructor. However, please open a Jira issue and so we don't
>> >> >>>> forget
>> >> about
>> >> >>>> it.
>> >> >>>
>> >> >>> It's the continuing stuff like this that makes me feel like we
>> >> >>> should
>> >> >>> be Spring (or equivalent) based someday... I'm just not sure how
>> >> >>> we're
>> >> >>> going to get there.
>> >> >>>
>> >> >>> -Yonik
>> >> >>> http://www.lucidimagination.com
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > -----------------------------------------------------
>> >> > Noble Paul | Principal Engineer| AOL | http://aol.com
>> >> >
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org