Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of
 msokolov@safaribooksonline.com designates 209.85.216.53 as permitted sender)
Message-ID: <538C5FB2.8000503@safaribooksonline.com>
Date: Mon, 02 Jun 2014 07:27:46 -0400
From: Michael Sokolov <msokolov@safaribooksonline.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:24.0) Gecko/20100101 Thunderbird/24.5.0
MIME-Version: 1.0
To: solr-user@lucene.apache.org
Subject: Re: Uneven shard heap usage
References: 
 <CAC8FukXJveUeqVH0Lb-zPJZgTtqbPgXRq_yMMHwgYDVPgQU7Vw@mail.gmail.com>
	<36E6E77B054C494783F7E8B4AFD50DF0@JackKrupansky14>
	<CAC8FukVB=uVtHdOdJ+fqE+S1CS8Jv7rbBCUcDcYO=_5YdPn_+g@mail.gmail.com>
	<CAN4YXvf-tWqOWoZpMbGTTPkvRBAiqSgFF1O03UN=neMzv-pb5w@mail.gmail.com>
	<CAC8FukVet3sgYGCrer2NaxJfZEHO+vG=E5r-JFPSiYV3X+-Ncw@mail.gmail.com>
	<CANNBgPKqLPYDZHn8gA_S+uB02sNnhF=KetyV2wHgG5fZQu4t8w@mail.gmail.com>
	<CAC8FukVwo-Y8hJoP3WrxPZVp-vRErvWovdpuCohmQvWHbi7=pQ@mail.gmail.com>
	<CAC8FukVvxZW9gqhrgBbSAM4FVHjxMkD03qOUjY8+GmvwBHCL_g@mail.gmail.com>
 <CAC8FukWJkqpEvp=s2n7TovFUDkDZuBq4xdWwabZHOHqCYy+QKA@mail.gmail.com>
In-Reply-To: 
 <CAC8FukWJkqpEvp=s2n7TovFUDkDZuBq4xdWwabZHOHqCYy+QKA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

Joe - there shouldn't really be a problem *indexing* these fields: 
remember that all the terms are spread across the index, so there is 
really no storage difference between one 180MB document and 180 1 MB 
documents from an indexing perspective.

Making the field "stored" is more likely to lead to a problem, although 
it's still a bit of a mystery exactly what's going on. Do they need to 
be stored? For example: do you highlight the entire field? Still 180MB 
shouldn't necessarily lead to heap space problems, but one thing you 
could play with is reducing the cache sizes on that node: if you had 
very large (in terms of numbers of documents) caches, and a lot of the 
documents were big, that could lead to heap problems.  But this is all 
just guessing.

-Mike


On 6/2/2014 6:13 AM, Joe Gresock wrote:
> And the followup question would be.. if some of these documents are
> legitimately this large (they really do have that much text), is there a
> good way to still allow that to be searchable and not explode our index?
>   These would be "text_en" type fields.
>
>
> On Mon, Jun 2, 2014 at 6:09 AM, Joe Gresock <jgresock@gmail.com> wrote:
>
>> So, we're definitely running into some very large documents (180MB, for
>> example).  I haven't run the analysis on the other 2 shards yet, but this
>> could definitely be our problem.
>>
>> Is there any conventional wisdom on a good "maximum size" for your indexed
>> fields?  Of course it will vary for each system, but assuming a heap of
>> 10g, does anyone have past experience in limiting their field sizes?
>>
>> Our caches are set to 128.
>>
>>
>> On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock <jgresock@gmail.com> wrote:
>>
>>> These are some good ideas.  The "huge document" idea could add up, since
>>> I think the shard1 index is a little larger (32.5GB on disk instead of
>>> 31.9GB), so it is possible there's one or 2 really big ones that are
>>> getting loaded into memory there.
>>>
>>> Btw, I did find an article on the Solr document routing (
>>> http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I
>>> don't think that our ID structure is a problem in itself.  But I will
>>> follow up on the large document idea.
>>>
>>> I used this article (
>>> https://support.datastax.com/entries/38367716-Solr-Configuration-Best-Practices-and-Troubleshooting-Tips)
>>> to find the index heap and disk usage:
>>> http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
>>>
>>> Though looking at the data index directory on disk basically said the
>>> same thing.
>>>
>>> I am pretty sure we're using the smart round-robining client, but I will
>>> double check on Monday.
>>>
>>> We have been using CollectD and graphite to monitor our VMs, as well as
>>> jvisualvm, though we haven't tried SPM.
>>>
>>> Thanks for all the ideas, guys.
>>>
>>>
>>> On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic <
>>> otis.gospodnetic@gmail.com> wrote:
>>>
>>>> Hi Joe,
>>>>
>>>> Are you/how are you sure all 3 shards are roughly the same size?  Can you
>>>> share what you run/see that shows you that?
>>>>
>>>> Are you sure queries are evenly distributed?  Something like SPM
>>>> <http://sematext.com/spm/> should give you insight into that.
>>>>
>>>> How big are your caches?
>>>>
>>>> Otis
>>>> --
>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jgresock@gmail.com> wrote:
>>>>
>>>>> Interesting thought about the routing.  Our document ids are in 3
>>>> parts:
>>>>> <10-digit identifier>!<epoch timestamp>!<format>
>>>>>
>>>>> e.g., 5/12345678!130000025603!TEXT
>>>>>
>>>>> Each object has an identifier, and there may be multiple versions of
>>>> the
>>>>> object, hence the timestamp.  We like to be able to pull back all of
>>>> the
>>>>> versions of an object at once, hence the routing scheme.
>>>>>
>>>>> The nature of the identifier is that a great many of them begin with a
>>>>> certain number.  I'd be interested to know more about the hashing
>>>> scheme
>>>>> used for the document routing.  Perhaps the first character gives it
>>>> more
>>>>> weight as to which shard it lands in?
>>>>>
>>>>> It seems strange that certain of the most highly-searched documents
>>>> would
>>>>> happen to fall on this shard, but you may be onto something.   We'll
>>>> scrape
>>>>> through some non-distributed queries and see what we can find.
>>>>>
>>>>>
>>>>> On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <
>>>> erickerickson@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> This is very weird.
>>>>>>
>>>>>> Are you sure that all the Java versions are identical? And all the
>>>> JVM
>>>>>> parameters are the same? Grasping at straws here.
>>>>>>
>>>>>> More grasping at straws: I'm a little suspicious that you are using
>>>>>> routing. You say that the indexes are about the same size, but is it
>>>> is
>>>>>> possible that your routing is somehow loading the problem shard
>>>>> abnormally?
>>>>>> By that I mean somehow the documents on that shard are different, or
>>>>> have a
>>>>>> drastically higher number of hits than the other shards?
>>>>>>
>>>>>> You can fire queries at shards with &distrib=false and NOT have it
>>>> go to
>>>>>> other shards, perhaps if you can isolate the problem queries that
>>>> might
>>>>>> shed some light on the problem.
>>>>>>
>>>>>>
>>>>>> Best
>>>>>> Erick@Baffled.com
>>>>>>
>>>>>>
>>>>>> On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jgresock@gmail.com>
>>>> wrote:
>>>>>>> It has taken as little as 2 minutes to happen the last time we
>>>> tried.
>>>>>   It
>>>>>>> basically happens upon high query load (peak user hours during the
>>>>> day).
>>>>>>>   When we reduce functionality by disabling most searches, it
>>>>> stabilizes.
>>>>>>>   So it really is only on high query load.  Our ingest rate is
>>>> fairly
>>>>> low.
>>>>>>> It happens no matter how many nodes in the shard are up.
>>>>>>>
>>>>>>>
>>>>>>> Joe
>>>>>>>
>>>>>>>
>>>>>>> On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <
>>>>>> jack@basetechnology.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> When you restart, how long does it take it hit the problem? And
>>>> how
>>>>>> much
>>>>>>>> query or update activity is happening in that time? Is there any
>>>>> other
>>>>>>>> activity showing up in the log?
>>>>>>>>
>>>>>>>> If you bring up only a single node in that problematic shard, do
>>>> you
>>>>>>> still
>>>>>>>> see the problem?
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>> -----Original Message----- From: Joe Gresock
>>>>>>>> Sent: Saturday, May 31, 2014 9:34 AM
>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>> Subject: Uneven shard heap usage
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi folks,
>>>>>>>>
>>>>>>>> I'm trying to figure out why one shard of an evenly-distributed
>>>>> 3-shard
>>>>>>>> cluster would suddenly start running out of heap space, after 9+
>>>>> months
>>>>>>> of
>>>>>>>> stable performance.  We're using the "!" delimiter in our ids to
>>>>>>> distribute
>>>>>>>> the documents, and indeed the disk size of our shards are very
>>>>> similar
>>>>>>>> (31-32GB on disk per replica).
>>>>>>>>
>>>>>>>> Our setup is:
>>>>>>>> 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio,
>>>> so
>>>>>>>> basically 2 physical CPUs), 24GB disk
>>>>>>>> 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).
>>>>   We
>>>>>>>> reserve 10g heap for each solr instance.
>>>>>>>> Also 3 zookeeper VMs, which are very stable
>>>>>>>>
>>>>>>>> Since the troubles started, we've been monitoring all 9 with
>>>>> jvisualvm,
>>>>>>> and
>>>>>>>> shards 2 and 3 keep a steady amount of heap space reserved,
>>>> always
>>>>>> having
>>>>>>>> horizontal lines (with some minor gc).  They're using 4-5GB
>>>> heap, and
>>>>>>> when
>>>>>>>> we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
>>>>> however,
>>>>>>>> quickly has a steep slope, and eventually has concurrent mode
>>>>> failures
>>>>>> in
>>>>>>>> the gc logs, requiring us to restart the instances when they can
>>>> no
>>>>>>> longer
>>>>>>>> do anything but gc.
>>>>>>>>
>>>>>>>> We've tried ruling out physical host problems by moving all 3
>>>> Shard 1
>>>>>>>> replicas to different hosts that are underutilized, however we
>>>> still
>>>>>> get
>>>>>>>> the same problem.  We'll still be working on ruling out
>>>>> infrastructure
>>>>>>>> issues, but I wanted to ask the questions here in case it makes
>>>>> sense:
>>>>>>>> * Does it make sense that all the replicas on one shard of a
>>>> cluster
>>>>>>> would
>>>>>>>> have heap problems, when the other shard replicas do not,
>>>> assuming a
>>>>>>> fairly
>>>>>>>> even data distribution?
>>>>>>>> * One thing we changed recently was to make all of our fields
>>>> stored,
>>>>>>>> instead of only half of them.  This was to support atomic
>>>> updates.
>>>>>   Can
>>>>>>>> stored fields, even though lazily loaded, cause problems like
>>>> this?
>>>>>>>> Thanks for any input,
>>>>>>>> Joe
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> I know what it is to be in need, and I know what it is to have
>>>>> plenty.
>>>>>>   I
>>>>>>>> have learned the secret of being content in any and every
>>>> situation,
>>>>>>>> whether well fed or hungry, whether living in plenty or in want.
>>>>   I
>>>>> can
>>>>>>> do
>>>>>>>> all this through him who gives me strength.    *-Philippians
>>>> 4:12-13*
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> I know what it is to be in need, and I know what it is to have
>>>> plenty.
>>>>>   I
>>>>>>> have learned the secret of being content in any and every
>>>> situation,
>>>>>>> whether well fed or hungry, whether living in plenty or in want.
>>>>   I can
>>>>>> do
>>>>>>> all this through him who gives me strength.    *-Philippians
>>>> 4:12-13*
>>>>>
>>>>>
>>>>> --
>>>>> I know what it is to be in need, and I know what it is to have plenty.
>>>>   I
>>>>> have learned the secret of being content in any and every situation,
>>>>> whether well fed or hungry, whether living in plenty or in want.  I
>>>> can do
>>>>> all this through him who gives me strength.    *-Philippians 4:12-13*
>>>>>
>>>
>>>
>>> --
>>> I know what it is to be in need, and I know what it is to have plenty.  I
>>> have learned the secret of being content in any and every situation,
>>> whether well fed or hungry, whether living in plenty or in want.  I can
>>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>>
>>
>>
>> --
>> I know what it is to be in need, and I know what it is to have plenty.  I
>> have learned the secret of being content in any and every situation,
>> whether well fed or hungry, whether living in plenty or in want.  I can
>> do all this through him who gives me strength.    *-Philippians 4:12-13*
>>
>
>