Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 48A0711885 for ; Sat, 31 May 2014 21:55:16 +0000 (UTC) Received: (qmail 20994 invoked by uid 500); 31 May 2014 21:55:12 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 20919 invoked by uid 500); 31 May 2014 21:55:12 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 20909 invoked by uid 99); 31 May 2014 21:55:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 31 May 2014 21:55:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jgresock@gmail.com designates 209.85.160.175 as permitted sender) Received: from [209.85.160.175] (HELO mail-yk0-f175.google.com) (209.85.160.175) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 31 May 2014 21:55:09 +0000 Received: by mail-yk0-f175.google.com with SMTP id 131so2634003ykp.20 for ; Sat, 31 May 2014 14:54:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=30l5NI6iTw4Tv+3olJfhtrodpdkAFF1vyHUkOBphbjQ=; b=s430+b4XQ7E1Hx5tKPWCdrSpDRa3DalSmAxOK9FswNJtZgOFu9OV3Abc16iQeeJ6GE MkQDjqzKu0ADTRwD44hSNGvXQ1OwsWdvSBzI3OHUkrqy050vGoi/P80RF+fUsgta3g4I 9dvMlvD2k5hZKj7PyMbpVbc5d8Il+NzzbQwXGxfpjGqSfenKAd55h6fs17RLw0G2Ng/W zPrpBqGLXb8EdbiwO5SD2gHJCr9mE2YuriuBsap4z7q6zyOux///CiSh76B+gXa9OtYd gU2EkTWDOT1xaafONTpCRDKMclWFVl41wuJzgObKspbGoeibC9z7iGphndKQbd34wwzr gGEg== MIME-Version: 1.0 X-Received: by 10.236.20.19 with SMTP id o19mr35826984yho.41.1401573284960; Sat, 31 May 2014 14:54:44 -0700 (PDT) Received: by 10.170.119.4 with HTTP; Sat, 31 May 2014 14:54:44 -0700 (PDT) In-Reply-To: References: <36E6E77B054C494783F7E8B4AFD50DF0@JackKrupansky14>

Date: Sat, 31 May 2014 17:54:44 -0400 Message-ID: Subject: Re: Uneven shard heap usage From: Joe Gresock To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a11c1bb82c6e06604fab93580 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c1bb82c6e06604fab93580 Content-Type: text/plain; charset=UTF-8 Interesting thought about the routing. Our document ids are in 3 parts: <10-digit identifier>!! e.g., 5/12345678!130000025603!TEXT Each object has an identifier, and there may be multiple versions of the object, hence the timestamp. We like to be able to pull back all of the versions of an object at once, hence the routing scheme. The nature of the identifier is that a great many of them begin with a certain number. I'd be interested to know more about the hashing scheme used for the document routing. Perhaps the first character gives it more weight as to which shard it lands in? It seems strange that certain of the most highly-searched documents would happen to fall on this shard, but you may be onto something. We'll scrape through some non-distributed queries and see what we can find. On Sat, May 31, 2014 at 1:47 PM, Erick Erickson wrote: > This is very weird. > > Are you sure that all the Java versions are identical? And all the JVM > parameters are the same? Grasping at straws here. > > More grasping at straws: I'm a little suspicious that you are using > routing. You say that the indexes are about the same size, but is it is > possible that your routing is somehow loading the problem shard abnormally? > By that I mean somehow the documents on that shard are different, or have a > drastically higher number of hits than the other shards? > > You can fire queries at shards with &distrib=false and NOT have it go to > other shards, perhaps if you can isolate the problem queries that might > shed some light on the problem. > > > Best > Erick@Baffled.com > > > On Sat, May 31, 2014 at 8:33 AM, Joe Gresock wrote: > > > It has taken as little as 2 minutes to happen the last time we tried. It > > basically happens upon high query load (peak user hours during the day). > > When we reduce functionality by disabling most searches, it stabilizes. > > So it really is only on high query load. Our ingest rate is fairly low. > > > > It happens no matter how many nodes in the shard are up. > > > > > > Joe > > > > > > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky < > jack@basetechnology.com> > > wrote: > > > > > When you restart, how long does it take it hit the problem? And how > much > > > query or update activity is happening in that time? Is there any other > > > activity showing up in the log? > > > > > > If you bring up only a single node in that problematic shard, do you > > still > > > see the problem? > > > > > > -- Jack Krupansky > > > > > > -----Original Message----- From: Joe Gresock > > > Sent: Saturday, May 31, 2014 9:34 AM > > > To: solr-user@lucene.apache.org > > > Subject: Uneven shard heap usage > > > > > > > > > Hi folks, > > > > > > I'm trying to figure out why one shard of an evenly-distributed 3-shard > > > cluster would suddenly start running out of heap space, after 9+ months > > of > > > stable performance. We're using the "!" delimiter in our ids to > > distribute > > > the documents, and indeed the disk size of our shards are very similar > > > (31-32GB on disk per replica). > > > > > > Our setup is: > > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so > > > basically 2 physical CPUs), 24GB disk > > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever). We > > > reserve 10g heap for each solr instance. > > > Also 3 zookeeper VMs, which are very stable > > > > > > Since the troubles started, we've been monitoring all 9 with jvisualvm, > > and > > > shards 2 and 3 keep a steady amount of heap space reserved, always > having > > > horizontal lines (with some minor gc). They're using 4-5GB heap, and > > when > > > we force gc using jvisualvm, they drop to 1GB usage. Shard 1, however, > > > quickly has a steep slope, and eventually has concurrent mode failures > in > > > the gc logs, requiring us to restart the instances when they can no > > longer > > > do anything but gc. > > > > > > We've tried ruling out physical host problems by moving all 3 Shard 1 > > > replicas to different hosts that are underutilized, however we still > get > > > the same problem. We'll still be working on ruling out infrastructure > > > issues, but I wanted to ask the questions here in case it makes sense: > > > > > > * Does it make sense that all the replicas on one shard of a cluster > > would > > > have heap problems, when the other shard replicas do not, assuming a > > fairly > > > even data distribution? > > > * One thing we changed recently was to make all of our fields stored, > > > instead of only half of them. This was to support atomic updates. Can > > > stored fields, even though lazily loaded, cause problems like this? > > > > > > Thanks for any input, > > > Joe > > > > > > > > > > > > > > > > > > -- > > > I know what it is to be in need, and I know what it is to have plenty. > I > > > have learned the secret of being content in any and every situation, > > > whether well fed or hungry, whether living in plenty or in want. I can > > do > > > all this through him who gives me strength. *-Philippians 4:12-13* > > > > > > > > > > > -- > > I know what it is to be in need, and I know what it is to have plenty. I > > have learned the secret of being content in any and every situation, > > whether well fed or hungry, whether living in plenty or in want. I can > do > > all this through him who gives me strength. *-Philippians 4:12-13* > > > -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13* --001a11c1bb82c6e06604fab93580--