Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of jgresock@gmail.com designates
 209.85.213.43 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAC8FukVwo-Y8hJoP3WrxPZVp-vRErvWovdpuCohmQvWHbi7=pQ@mail.gmail.com>
References: 
 <CAC8FukXJveUeqVH0Lb-zPJZgTtqbPgXRq_yMMHwgYDVPgQU7Vw@mail.gmail.com>
	<36E6E77B054C494783F7E8B4AFD50DF0@JackKrupansky14>
	<CAC8FukVB=uVtHdOdJ+fqE+S1CS8Jv7rbBCUcDcYO=_5YdPn_+g@mail.gmail.com>
	<CAN4YXvf-tWqOWoZpMbGTTPkvRBAiqSgFF1O03UN=neMzv-pb5w@mail.gmail.com>
	<CAC8FukVet3sgYGCrer2NaxJfZEHO+vG=E5r-JFPSiYV3X+-Ncw@mail.gmail.com>
	<CANNBgPKqLPYDZHn8gA_S+uB02sNnhF=KetyV2wHgG5fZQu4t8w@mail.gmail.com>
	<CAC8FukVwo-Y8hJoP3WrxPZVp-vRErvWovdpuCohmQvWHbi7=pQ@mail.gmail.com>
Date: Mon, 2 Jun 2014 06:09:33 -0400
Message-ID: 
 <CAC8FukVvxZW9gqhrgBbSAM4FVHjxMkD03qOUjY8+GmvwBHCL_g@mail.gmail.com>
Subject: Re: Uneven shard heap usage
From: Joe Gresock <jgresock@gmail.com>
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=089e0112c11083750d04fad797aa

--089e0112c11083750d04fad797aa
Content-Type: text/plain; charset=UTF-8

So, we're definitely running into some very large documents (180MB, for
example).  I haven't run the analysis on the other 2 shards yet, but this
could definitely be our problem.

Is there any conventional wisdom on a good "maximum size" for your indexed
fields?  Of course it will vary for each system, but assuming a heap of
10g, does anyone have past experience in limiting their field sizes?

Our caches are set to 128.


On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jgresock@gmail.com> wrote:

> These are some good ideas.  The "huge document" idea could add up, since I
> think the shard1 index is a little larger (32.5GB on disk instead of
> 31.9GB), so it is possible there's one or 2 really big ones that are
> getting loaded into memory there.
>
> Btw, I did find an article on the Solr document routing (
> http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I don't
> think that our ID structure is a problem in itself.  But I will follow up
> on the large document idea.
>
> I used this article (
> https://support.datastax.com/entries/38367716-Solr-Configuration-Best-Practices-and-Troubleshooting-Tips)
> to find the index heap and disk usage:
> http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
>
> Though looking at the data index directory on disk basically said the same
> thing.
>
> I am pretty sure we're using the smart round-robining client, but I will
> double check on Monday.
>
> We have been using CollectD and graphite to monitor our VMs, as well as
> jvisualvm, though we haven't tried SPM.
>
> Thanks for all the ideas, guys.
>
>
> On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi Joe,
>>
>> Are you/how are you sure all 3 shards are roughly the same size?  Can you
>> share what you run/see that shows you that?
>>
>> Are you sure queries are evenly distributed?  Something like SPM
>> <http://sematext.com/spm/> should give you insight into that.
>>
>> How big are your caches?
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jgresock@gmail.com> wrote:
>>
>> > Interesting thought about the routing.  Our document ids are in 3 parts:
>> >
>> > <10-digit identifier>!<epoch timestamp>!<format>
>> >
>> > e.g., 5/12345678!130000025603!TEXT
>> >
>> > Each object has an identifier, and there may be multiple versions of the
>> > object, hence the timestamp.  We like to be able to pull back all of the
>> > versions of an object at once, hence the routing scheme.
>> >
>> > The nature of the identifier is that a great many of them begin with a
>> > certain number.  I'd be interested to know more about the hashing scheme
>> > used for the document routing.  Perhaps the first character gives it
>> more
>> > weight as to which shard it lands in?
>> >
>> > It seems strange that certain of the most highly-searched documents
>> would
>> > happen to fall on this shard, but you may be onto something.   We'll
>> scrape
>> > through some non-distributed queries and see what we can find.
>> >
>> >
>> > On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <
>> erickerickson@gmail.com>
>> > wrote:
>> >
>> > > This is very weird.
>> > >
>> > > Are you sure that all the Java versions are identical? And all the JVM
>> > > parameters are the same? Grasping at straws here.
>> > >
>> > > More grasping at straws: I'm a little suspicious that you are using
>> > > routing. You say that the indexes are about the same size, but is it
>> is
>> > > possible that your routing is somehow loading the problem shard
>> > abnormally?
>> > > By that I mean somehow the documents on that shard are different, or
>> > have a
>> > > drastically higher number of hits than the other shards?
>> > >
>> > > You can fire queries at shards with &distrib=false and NOT have it go
>> to
>> > > other shards, perhaps if you can isolate the problem queries that
>> might
>> > > shed some light on the problem.
>> > >
>> > >
>> > > Best
>> > > Erick@Baffled.com
>> > >
>> > >
>> > > On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jgresock@gmail.com>
>> wrote:
>> > >
>> > > > It has taken as little as 2 minutes to happen the last time we
>> tried.
>> >  It
>> > > > basically happens upon high query load (peak user hours during the
>> > day).
>> > > >  When we reduce functionality by disabling most searches, it
>> > stabilizes.
>> > > >  So it really is only on high query load.  Our ingest rate is fairly
>> > low.
>> > > >
>> > > > It happens no matter how many nodes in the shard are up.
>> > > >
>> > > >
>> > > > Joe
>> > > >
>> > > >
>> > > > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
>> > > jack@basetechnology.com>
>> > > > wrote:
>> > > >
>> > > > > When you restart, how long does it take it hit the problem? And
>> how
>> > > much
>> > > > > query or update activity is happening in that time? Is there any
>> > other
>> > > > > activity showing up in the log?
>> > > > >
>> > > > > If you bring up only a single node in that problematic shard, do
>> you
>> > > > still
>> > > > > see the problem?
>> > > > >
>> > > > > -- Jack Krupansky
>> > > > >
>> > > > > -----Original Message----- From: Joe Gresock
>> > > > > Sent: Saturday, May 31, 2014 9:34 AM
>> > > > > To: solr-user@lucene.apache.org
>> > > > > Subject: Uneven shard heap usage
>> > > > >
>> > > > >
>> > > > > Hi folks,
>> > > > >
>> > > > > I'm trying to figure out why one shard of an evenly-distributed
>> > 3-shard
>> > > > > cluster would suddenly start running out of heap space, after 9+
>> > months
>> > > > of
>> > > > > stable performance.  We're using the "!" delimiter in our ids to
>> > > > distribute
>> > > > > the documents, and indeed the disk size of our shards are very
>> > similar
>> > > > > (31-32GB on disk per replica).
>> > > > >
>> > > > > Our setup is:
>> > > > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio,
>> so
>> > > > > basically 2 physical CPUs), 24GB disk
>> > > > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).
>>  We
>> > > > > reserve 10g heap for each solr instance.
>> > > > > Also 3 zookeeper VMs, which are very stable
>> > > > >
>> > > > > Since the troubles started, we've been monitoring all 9 with
>> > jvisualvm,
>> > > > and
>> > > > > shards 2 and 3 keep a steady amount of heap space reserved, always
>> > > having
>> > > > > horizontal lines (with some minor gc).  They're using 4-5GB heap,
>> and
>> > > > when
>> > > > > we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
>> > however,
>> > > > > quickly has a steep slope, and eventually has concurrent mode
>> > failures
>> > > in
>> > > > > the gc logs, requiring us to restart the instances when they can
>> no
>> > > > longer
>> > > > > do anything but gc.
>> > > > >
>> > > > > We've tried ruling out physical host problems by moving all 3
>> Shard 1
>> > > > > replicas to different hosts that are underutilized, however we
>> still
>> > > get
>> > > > > the same problem.  We'll still be working on ruling out
>> > infrastructure
>> > > > > issues, but I wanted to ask the questions here in case it makes
>> > sense:
>> > > > >
>> > > > > * Does it make sense that all the replicas on one shard of a
>> cluster
>> > > > would
>> > > > > have heap problems, when the other shard replicas do not,
>> assuming a
>> > > > fairly
>> > > > > even data distribution?
>> > > > > * One thing we changed recently was to make all of our fields
>> stored,
>> > > > > instead of only half of them.  This was to support atomic updates.
>> >  Can
>> > > > > stored fields, even though lazily loaded, cause problems like
>> this?
>> > > > >
>> > > > > Thanks for any input,
>> > > > > Joe
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > I know what it is to be in need, and I know what it is to have
>> > plenty.
>> > >  I
>> > > > > have learned the secret of being content in any and every
>> situation,
>> > > > > whether well fed or hungry, whether living in plenty or in want.
>>  I
>> > can
>> > > > do
>> > > > > all this through him who gives me strength.    *-Philippians
>> 4:12-13*
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > I know what it is to be in need, and I know what it is to have
>> plenty.
>> >  I
>> > > > have learned the secret of being content in any and every situation,
>> > > > whether well fed or hungry, whether living in plenty or in want.  I
>> can
>> > > do
>> > > > all this through him who gives me strength.    *-Philippians
>> 4:12-13*
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > I know what it is to be in need, and I know what it is to have plenty.
>>  I
>> > have learned the secret of being content in any and every situation,
>> > whether well fed or hungry, whether living in plenty or in want.  I can
>> do
>> > all this through him who gives me strength.    *-Philippians 4:12-13*
>> >
>>
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can
> do all this through him who gives me strength.    *-Philippians 4:12-13*
>


-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

--089e0112c11083750d04fad797aa--