Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A4BDB11906 for ; Mon, 2 Jun 2014 10:10:03 +0000 (UTC) Received: (qmail 12883 invoked by uid 500); 2 Jun 2014 10:10:01 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 12806 invoked by uid 500); 2 Jun 2014 10:10:00 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 12795 invoked by uid 99); 2 Jun 2014 10:10:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Jun 2014 10:10:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jgresock@gmail.com designates 209.85.213.43 as permitted sender) Received: from [209.85.213.43] (HELO mail-yh0-f43.google.com) (209.85.213.43) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Jun 2014 10:09:57 +0000 Received: by mail-yh0-f43.google.com with SMTP id v1so3605282yhn.16 for ; Mon, 02 Jun 2014 03:09:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=dkdP19Mz+ZatXu8wO851YH8Ul7WbW6eLB/lDgezCZ94=; b=NWhZZee6bUE6Nj3KvHnOdBFwioO0vd/75+cPEDE95SkcwPoM1gxrczBOXG4ZxBqW4Q D6kIgAHVUhqhsZ3OMM0zsiSW3b71LuPeThcLqWf+O2STTi8T5dQ66csAOGAHoY1q3InR FiR6RdTf+pDuKgZgOXVsew9KpDvvA2jGKxVOqDyGPnqZsYvQ8tezCSbSFTfLB+CzN3r/ p2vqpWIE1pzIVYYHOw9Fw2I12YiYwKYVEuQCk+jMaiBF1l/vxv0rMVjXBkORjS7Ky0fn wS31ysP4b83uN8q5toqgqcdSq8tMBFJsH8sUIvR9iYrrL1W2wmNtlxcwCkE3DLAxLk1p M66Q== MIME-Version: 1.0 X-Received: by 10.236.2.6 with SMTP id 6mr2078845yhe.144.1401703773728; Mon, 02 Jun 2014 03:09:33 -0700 (PDT) Received: by 10.170.119.4 with HTTP; Mon, 2 Jun 2014 03:09:33 -0700 (PDT) In-Reply-To: References: <36E6E77B054C494783F7E8B4AFD50DF0@JackKrupansky14> Date: Mon, 2 Jun 2014 06:09:33 -0400 Message-ID: Subject: Re: Uneven shard heap usage From: Joe Gresock To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=089e0112c11083750d04fad797aa X-Virus-Checked: Checked by ClamAV on apache.org --089e0112c11083750d04fad797aa Content-Type: text/plain; charset=UTF-8 So, we're definitely running into some very large documents (180MB, for example). I haven't run the analysis on the other 2 shards yet, but this could definitely be our problem. Is there any conventional wisdom on a good "maximum size" for your indexed fields? Of course it will vary for each system, but assuming a heap of 10g, does anyone have past experience in limiting their field sizes? Our caches are set to 128. On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock wrote: > These are some good ideas. The "huge document" idea could add up, since I > think the shard1 index is a little larger (32.5GB on disk instead of > 31.9GB), so it is possible there's one or 2 really big ones that are > getting loaded into memory there. > > Btw, I did find an article on the Solr document routing ( > http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I don't > think that our ID structure is a problem in itself. But I will follow up > on the large document idea. > > I used this article ( > https://support.datastax.com/entries/38367716-Solr-Configuration-Best-Practices-and-Troubleshooting-Tips) > to find the index heap and disk usage: > http://localhost:8983/solr/admin/cores?action=STATUS&memory=true > > Though looking at the data index directory on disk basically said the same > thing. > > I am pretty sure we're using the smart round-robining client, but I will > double check on Monday. > > We have been using CollectD and graphite to monitor our VMs, as well as > jvisualvm, though we haven't tried SPM. > > Thanks for all the ideas, guys. > > > On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic < > otis.gospodnetic@gmail.com> wrote: > >> Hi Joe, >> >> Are you/how are you sure all 3 shards are roughly the same size? Can you >> share what you run/see that shows you that? >> >> Are you sure queries are evenly distributed? Something like SPM >> should give you insight into that. >> >> How big are your caches? >> >> Otis >> -- >> Performance Monitoring * Log Analytics * Search Analytics >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock wrote: >> >> > Interesting thought about the routing. Our document ids are in 3 parts: >> > >> > <10-digit identifier>!! >> > >> > e.g., 5/12345678!130000025603!TEXT >> > >> > Each object has an identifier, and there may be multiple versions of the >> > object, hence the timestamp. We like to be able to pull back all of the >> > versions of an object at once, hence the routing scheme. >> > >> > The nature of the identifier is that a great many of them begin with a >> > certain number. I'd be interested to know more about the hashing scheme >> > used for the document routing. Perhaps the first character gives it >> more >> > weight as to which shard it lands in? >> > >> > It seems strange that certain of the most highly-searched documents >> would >> > happen to fall on this shard, but you may be onto something. We'll >> scrape >> > through some non-distributed queries and see what we can find. >> > >> > >> > On Sat, May 31, 2014 at 1:47 PM, Erick Erickson < >> erickerickson@gmail.com> >> > wrote: >> > >> > > This is very weird. >> > > >> > > Are you sure that all the Java versions are identical? And all the JVM >> > > parameters are the same? Grasping at straws here. >> > > >> > > More grasping at straws: I'm a little suspicious that you are using >> > > routing. You say that the indexes are about the same size, but is it >> is >> > > possible that your routing is somehow loading the problem shard >> > abnormally? >> > > By that I mean somehow the documents on that shard are different, or >> > have a >> > > drastically higher number of hits than the other shards? >> > > >> > > You can fire queries at shards with &distrib=false and NOT have it go >> to >> > > other shards, perhaps if you can isolate the problem queries that >> might >> > > shed some light on the problem. >> > > >> > > >> > > Best >> > > Erick@Baffled.com >> > > >> > > >> > > On Sat, May 31, 2014 at 8:33 AM, Joe Gresock >> wrote: >> > > >> > > > It has taken as little as 2 minutes to happen the last time we >> tried. >> > It >> > > > basically happens upon high query load (peak user hours during the >> > day). >> > > > When we reduce functionality by disabling most searches, it >> > stabilizes. >> > > > So it really is only on high query load. Our ingest rate is fairly >> > low. >> > > > >> > > > It happens no matter how many nodes in the shard are up. >> > > > >> > > > >> > > > Joe >> > > > >> > > > >> > > > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky < >> > > jack@basetechnology.com> >> > > > wrote: >> > > > >> > > > > When you restart, how long does it take it hit the problem? And >> how >> > > much >> > > > > query or update activity is happening in that time? Is there any >> > other >> > > > > activity showing up in the log? >> > > > > >> > > > > If you bring up only a single node in that problematic shard, do >> you >> > > > still >> > > > > see the problem? >> > > > > >> > > > > -- Jack Krupansky >> > > > > >> > > > > -----Original Message----- From: Joe Gresock >> > > > > Sent: Saturday, May 31, 2014 9:34 AM >> > > > > To: solr-user@lucene.apache.org >> > > > > Subject: Uneven shard heap usage >> > > > > >> > > > > >> > > > > Hi folks, >> > > > > >> > > > > I'm trying to figure out why one shard of an evenly-distributed >> > 3-shard >> > > > > cluster would suddenly start running out of heap space, after 9+ >> > months >> > > > of >> > > > > stable performance. We're using the "!" delimiter in our ids to >> > > > distribute >> > > > > the documents, and indeed the disk size of our shards are very >> > similar >> > > > > (31-32GB on disk per replica). >> > > > > >> > > > > Our setup is: >> > > > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, >> so >> > > > > basically 2 physical CPUs), 24GB disk >> > > > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever). >> We >> > > > > reserve 10g heap for each solr instance. >> > > > > Also 3 zookeeper VMs, which are very stable >> > > > > >> > > > > Since the troubles started, we've been monitoring all 9 with >> > jvisualvm, >> > > > and >> > > > > shards 2 and 3 keep a steady amount of heap space reserved, always >> > > having >> > > > > horizontal lines (with some minor gc). They're using 4-5GB heap, >> and >> > > > when >> > > > > we force gc using jvisualvm, they drop to 1GB usage. Shard 1, >> > however, >> > > > > quickly has a steep slope, and eventually has concurrent mode >> > failures >> > > in >> > > > > the gc logs, requiring us to restart the instances when they can >> no >> > > > longer >> > > > > do anything but gc. >> > > > > >> > > > > We've tried ruling out physical host problems by moving all 3 >> Shard 1 >> > > > > replicas to different hosts that are underutilized, however we >> still >> > > get >> > > > > the same problem. We'll still be working on ruling out >> > infrastructure >> > > > > issues, but I wanted to ask the questions here in case it makes >> > sense: >> > > > > >> > > > > * Does it make sense that all the replicas on one shard of a >> cluster >> > > > would >> > > > > have heap problems, when the other shard replicas do not, >> assuming a >> > > > fairly >> > > > > even data distribution? >> > > > > * One thing we changed recently was to make all of our fields >> stored, >> > > > > instead of only half of them. This was to support atomic updates. >> > Can >> > > > > stored fields, even though lazily loaded, cause problems like >> this? >> > > > > >> > > > > Thanks for any input, >> > > > > Joe >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > I know what it is to be in need, and I know what it is to have >> > plenty. >> > > I >> > > > > have learned the secret of being content in any and every >> situation, >> > > > > whether well fed or hungry, whether living in plenty or in want. >> I >> > can >> > > > do >> > > > > all this through him who gives me strength. *-Philippians >> 4:12-13* >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > I know what it is to be in need, and I know what it is to have >> plenty. >> > I >> > > > have learned the secret of being content in any and every situation, >> > > > whether well fed or hungry, whether living in plenty or in want. I >> can >> > > do >> > > > all this through him who gives me strength. *-Philippians >> 4:12-13* >> > > > >> > > >> > >> > >> > >> > -- >> > I know what it is to be in need, and I know what it is to have plenty. >> I >> > have learned the secret of being content in any and every situation, >> > whether well fed or hungry, whether living in plenty or in want. I can >> do >> > all this through him who gives me strength. *-Philippians 4:12-13* >> > >> > > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can > do all this through him who gives me strength. *-Philippians 4:12-13* > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13* --089e0112c11083750d04fad797aa--