Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com
 designates 209.85.216.169 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=GgqVDlxSrhZpoEvCKba+GdArO7i2/avDO9/Pcm2HFVTZLrO1afs5HxuJv6OREpLRbO
         43V3gVgzkeY7c/ns4reAjqiD27k67K9nXZe/Kq3ZtxgXOSjQ6/wpkM7gEuRfsQ2j7f61
         lO+Ugl+c736GGmkyqUDoQImwiPuAiVHaqlwME=
MIME-Version: 1.0
In-Reply-To: 
 <8C5580C57D1DF248B104CCC52BECC4580ADDA2BFF3@VUEX1.vuad.villanova.edu>
References: 
 <8C5580C57D1DF248B104CCC52BECC4580ADDA2BBCC@VUEX1.vuad.villanova.edu>
	<BANLkTikEA-+3vnxSUH9ENZT5thaC+S3W9A@mail.gmail.com>
	<8C5580C57D1DF248B104CCC52BECC4580ADDA2BD3A@VUEX1.vuad.villanova.edu>
	<BANLkTim5Avpx-rH0pmoQZinJ_qF-eu7z9A@mail.gmail.com>
	<8C5580C57D1DF248B104CCC52BECC4580ADDA2BFF3@VUEX1.vuad.villanova.edu>
Date: Mon, 6 Jun 2011 11:59:18 -0400
Message-ID: <BANLkTikWAZcR9O21huoSC=gGq5YfYLyULA@mail.gmail.com>
Subject: Re: Solr performance tuning - disk i/o?
From: Erick Erickson <erickerickson@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Polling interval was in reference to slaves in a multi-machine
master/slave setup. so probably not
a concern just at present.

Warmup time of 0 is not particularly normal, I'm not quite sure what's
going on there but you may
want to look at firstsearcher, newsearcher and autowarm parameters in
config.xml..

Best
Erick

On Mon, Jun 6, 2011 at 9:08 AM, Demian Katz <demian.katz@villanova.edu> wro=
te:
> Thanks once again for the helpful suggestions!
>
> Regarding the selection of facet fields, I think publishDate (which is ac=
tually just a year) and callnumber-first (which is actually a very broad, h=
igh-level category) are okay. =A0authorStr is an interesting problem: it's =
definitely a useful facet (when a user searches for an author, odds are goo=
d that they want the one who published the most books... i.e. a search for =
dickens will probably show Charles Dickens at the top of the facet list), b=
ut it has a long tail since there are many minor authors who have only publ=
ished one or two books... =A0Is there a possibility that the facet.mincount=
 parameter could be helpful here, or does that have no impact on performanc=
e/memory footprint?
>
> Regarding polling interval for slaves, are you referring to a distributed=
 Solr environment, or is this something to do with Solr's internals? =A0We'=
re currently a single-server environment, so I don't think I have to worry =
if it's related to a multi-server setup... =A0but if it's something interna=
l, could you point me to the right area of the admin panel to check my stat=
s? =A0I'm not seeing anything about polling on the statistics page. =A0It's=
 also a little strange that all of my warmupTime stats on searchers and cac=
hes are showing as 0 -- is that normal?
>
> thanks,
> Demian
>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> Sent: Friday, June 03, 2011 4:45 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr performance tuning - disk i/o?
>>
>> Quick impressions:
>>
>> The faceting is usually best done on fields that don't have lots of
>> unique
>> values for three reasons:
>> 1> It's questionable how much use to the user to have a gazillion
>> facets.
>> =A0 =A0 =A0In the case of a unique field per document, in fact, it's use=
less.
>> 2> resource requirements go up as a function of the number of unique
>> =A0 =A0 =A0terms. This is true for faceting and sorting.
>> 3> warmup times grow the more terms have to be read into memory.
>>
>>
>> Glancing at your warmup stuff, things like publishDate, authorStr and
>> maybe
>> callnumber-first are questionable. publishDate depends on how coarse
>> the
>> resolution is. If it's by day, that's not really much use. authorStr..
>> How many
>> authors have more than one publication? Would this be better served by
>> some
>> kind of autosuggest rather than facets? callnumber-first... I don't
>> really know, but
>> if it's unique per document it's probably not something the user would
>> find useful
>> as a facet.
>>
>> The admin page will help you determine the number of unique terms per
>> field,
>> which may guide you whether or not to continue to facet on these
>> fields.
>>
>> As Otis said, doing a sort on the fields during warmup will also help.
>>
>> Watch your polling interval for any slaves in relation to the warmup
>> times.
>> If your polling interval is shorter than the warmup times, you run a
>> risk of
>> "runaway warmups".
>>
>> As you've figured out, measuring responses to the first few queries
>> doesn't
>> always measure what you really need <G>..
>>
>> I don't have the pages handy, but autowarming is a good topic to
>> understand,
>> so you might spend some time tracking it down.
>>
>> Best
>> Erick
>>
>> On Fri, Jun 3, 2011 at 11:21 AM, Demian Katz
>> <demian.katz@villanova.edu> wrote:
>> > Thanks to you and Otis for the suggestions! =A0Some more information:
>> >
>> > - Based on the Solr stats page, my caches seem to be working pretty
>> well (few or no evictions, hit rates in the 75-80% range).
>> > - VuFind is actually doing two Solr queries per search (one initial
>> search followed by a supplemental spell check search -- I believe this
>> is necessary because VuFind has two separate spelling indexes, one for
>> shingled terms and one for single words). =A0That is probably
>> exaggerating the problem, though based on searches with debugQuery on,
>> it looks like it's always the initial search (rather than the
>> supplemental spelling search) that's consuming the bulk of the time.
>> > - enableLazyFieldLoading is set to true.
>> > - I'm retrieving 20 documents per page.
>> > - My JVM settings: -server -
>> Xloggc:/usr/local/vufind/solr/jetty/logs/gc.log -Xms4096m -Xmx4096m -
>> XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=3D5
>> >
>> > It appears that a large portion of my problem had to do with
>> autowarming, a topic that I've never had a strong grasp on, though
>> perhaps I'm finally learning (any recommended primer links would be
>> welcome!). =A0I did have some autowarming settings in solrconfig.xml (an
>> arbitrary search for a bunch of random keywords in the newSearcher and
>> firstSearcher events, plus autowarmCount settings on all of my caches).
>> =A0However, when I looked at the debugQuery output, I noticed that a hug=
e
>> amount of time was being wasted loading facets on the first search
>> after restarting Solr, so I changed my newSearcher and firstSearcher
>> events to this:
>> >
>> > =A0 =A0 =A0<arr name=3D"queries">
>> > =A0 =A0 =A0 =A0<lst>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"q">*:*</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"start">0</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"rows">10</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet">true</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.mincount">1</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">collection</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">format</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">publishDate</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">callnumber-first</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">topic_facet</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">authorStr</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">language</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">genre_facet</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">era_facet</str>
>> > =A0 =A0 =A0 =A0 =A0<str name=3D"facet.field">geographic_facet</str>
>> > =A0 =A0 =A0 =A0</lst>
>> > =A0 =A0 =A0</arr>
>> >
>> > Overall performance has now increased dramatically, and now the
>> biggest bottleneck in the debug output seems to be the shingle spell
>> checking!
>> >
>> > Any other suggestions are welcome, since I suspect there's still room
>> to squeeze more performance out of the system, and I'm still not sure
>> I'm making the most of autowarming... =A0but this seems like a big step
>> in the right direction. =A0Thanks again for the help!
>> >
>> > - Demian
>> >
>> >> -----Original Message-----
>> >> From: Erick Erickson [mailto:erickerickson@gmail.com]
>> >> Sent: Friday, June 03, 2011 9:41 AM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Solr performance tuning - disk i/o?
>> >>
>> >> This doesn't seem right. Here's a couple of things to try:
>> >> 1> attach &debugQuery=3Don to your long-running queries. The QTime
>> >> returned
>> >> =A0 =A0 =A0is the time taken to search, NOT including the time to loa=
d the
>> >> docs. That'll
>> >> =A0 =A0 =A0help pinpoint whether the problem is the search itself, or
>> >> assembling the
>> >> =A0 =A0 =A0documents.
>> >> 2> Are you autowarming? If so, be sure it's actually done before
>> >> querying.
>> >> 3> Measure queries after the first few, particularly if you're
>> sorting
>> >> or
>> >> =A0 =A0 =A0faceting.
>> >> 4> What are your JVM settings? How much memory do you have?
>> >> 5> is <enableLazyFieldLoading> set to true in your solrconfig.xml?
>> >> 6> How many docs are you returning?
>> >>
>> >>
>> >> There's more, but that'll do for a start.... Let us know if you
>> gather
>> >> more data
>> >> and it's still slow.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Fri, Jun 3, 2011 at 8:44 AM, Demian Katz
>> <demian.katz@villanova.edu>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm trying to move a VuFind installation from an ailing physical
>> >> server into a virtualized environment, and I'm running into
>> performance
>> >> problems. =A0VuFind is a Solr 1.4.1-based application with fairly
>> large
>> >> and complex records (many stored fields, many words per record). =A0M=
y
>> >> particular installation contains about a million records in the
>> index,
>> >> with a total index size around 6GB.
>> >> >
>> >> > The virtual environment has more RAM and better CPUs than the old
>> >> physical box, and I am satisfied that my Java environment is well-
>> >> tuned. =A0My index is optimized. =A0Searches that hit the cache respo=
nd
>> >> very well. =A0The problem is that non-cached searches are very slow -
>> the
>> >> more keywords I add, the slower they get, to the point of taking 6-
>> 12
>> >> seconds to come back with results on a quiet box and well over a
>> minute
>> >> under stress testing. =A0(The old box still took a while for
>> equivalent
>> >> searches, but it was about twice as fast as the new one).
>> >> >
>> >> > My gut feeling is that disk access reading the index is the
>> >> bottleneck here, but I know little about the specifics of Solr's
>> >> internals, so it's entirely possible that my gut is wrong. =A0Outside
>> >> testing does show that the the virtual environment's disk
>> performance
>> >> is not as good as the old physical server, especially when multiple
>> >> processes are trying to access the same file simultaneously.
>> >> >
>> >> > So, two basic questions:
>> >> >
>> >> >
>> >> > 1.) =A0 =A0Would you agree that I'm dealing with a disk bottleneck,=
 or
>> >> are there some other factors I should be considering? =A0Any good
>> >> diagnostics I should be looking at?
>> >> >
>> >> > 2.) =A0 =A0If the problem is disk access, is there anything I can t=
une
>> on
>> >> the Solr side to alleviate the problems?
>> >> >
>> >> > Thanks,
>> >> > Demian
>> >> >
>> >
>