Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1C02811A89 for ; Mon, 2 Jun 2014 11:28:18 +0000 (UTC) Received: (qmail 5930 invoked by uid 500); 2 Jun 2014 11:28:15 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 5862 invoked by uid 500); 2 Jun 2014 11:28:14 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 5851 invoked by uid 99); 2 Jun 2014 11:28:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Jun 2014 11:28:14 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of msokolov@safaribooksonline.com designates 209.85.216.53 as permitted sender) Received: from [209.85.216.53] (HELO mail-qa0-f53.google.com) (209.85.216.53) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Jun 2014 11:28:10 +0000 Received: by mail-qa0-f53.google.com with SMTP id x12so2448487qac.26 for ; Mon, 02 Jun 2014 04:27:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=safaribooksonline.com; s=google; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=6xp0erb+vBUNWepfgEw1xCpoH7P3A6eYlw8jPRNVcCU=; b=ILDRPcOaUj50amY+loMQXrEt6igNSgKcMC6xLoXslBM3g0+25b4cXJnoJ6rW0qJkJu DxJS5zJU4qYE+nLz5K78jDhVGMtTCLsCzg4keCz3s4zATXMWqGVc3tj2/W9+m1dNyjXK LXQodEt3bes2dpwsE3MXsJL1wgFQ60CeUT/F4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=6xp0erb+vBUNWepfgEw1xCpoH7P3A6eYlw8jPRNVcCU=; b=SvTvwxgeDmvEnfgL+hq6//Ory5Wc+QvgXWFXg3xliQiGplGUSkdKwH8nUCJ+tBeBWI FsJVZjcosi39T3AG/HTZyKrGrcIcLpS0hi8B/+JcsDI0Hbq3uwA2N3kvuWhXfueZCdjG sq1pghOE7OIM6YPYLBJDAlt3Yrf/DVmJsMX/rlNGBQGn2Ys6zlWsE6qmASSFAaC0c6pC pbcjCZyIUEASq/WAjZhkCFNij574ZCSzyXUmvAIn0fJhubN/SA+MchNLOUq4sjReBVmD Rq7Gq/SAItudQFtHPf6QfG5slSnyjtId1VnuZTuTsQRrSQzzWxeA3tNR0DhomSssp551 /h7g== X-Gm-Message-State: ALoCoQkkzI53PWT6z3n5R7L+c57XrhE1xJmy2NdOZw8IyVltit5KRN8R8OXQmgNNPpCBwgNABP8l X-Received: by 10.140.38.18 with SMTP id s18mr46013567qgs.88.1401708469753; Mon, 02 Jun 2014 04:27:49 -0700 (PDT) Received: from [192.168.1.3] (pool-71-174-254-95.bstnma.fios.verizon.net. [71.174.254.95]) by mx.google.com with ESMTPSA id m13sm21022399qab.19.2014.06.02.04.27.48 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 02 Jun 2014 04:27:48 -0700 (PDT) Message-ID: <538C5FB2.8000503@safaribooksonline.com> Date: Mon, 02 Jun 2014 07:27:46 -0400 From: Michael Sokolov User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: solr-user@lucene.apache.org Subject: Re: Uneven shard heap usage References: <36E6E77B054C494783F7E8B4AFD50DF0@JackKrupansky14> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Joe - there shouldn't really be a problem *indexing* these fields: remember that all the terms are spread across the index, so there is really no storage difference between one 180MB document and 180 1 MB documents from an indexing perspective. Making the field "stored" is more likely to lead to a problem, although it's still a bit of a mystery exactly what's going on. Do they need to be stored? For example: do you highlight the entire field? Still 180MB shouldn't necessarily lead to heap space problems, but one thing you could play with is reducing the cache sizes on that node: if you had very large (in terms of numbers of documents) caches, and a lot of the documents were big, that could lead to heap problems. But this is all just guessing. -Mike On 6/2/2014 6:13 AM, Joe Gresock wrote: > And the followup question would be.. if some of these documents are > legitimately this large (they really do have that much text), is there a > good way to still allow that to be searchable and not explode our index? > These would be "text_en" type fields. > > > On Mon, Jun 2, 2014 at 6:09 AM, Joe Gresock wrote: > >> So, we're definitely running into some very large documents (180MB, for >> example). I haven't run the analysis on the other 2 shards yet, but this >> could definitely be our problem. >> >> Is there any conventional wisdom on a good "maximum size" for your indexed >> fields? Of course it will vary for each system, but assuming a heap of >> 10g, does anyone have past experience in limiting their field sizes? >> >> Our caches are set to 128. >> >> >> On Sun, Jun 1, 2014 at 8:32 AM, Joe Gresock wrote: >> >>> These are some good ideas. The "huge document" idea could add up, since >>> I think the shard1 index is a little larger (32.5GB on disk instead of >>> 31.9GB), so it is possible there's one or 2 really big ones that are >>> getting loaded into memory there. >>> >>> Btw, I did find an article on the Solr document routing ( >>> http://searchhub.org/2013/06/13/solr-cloud-document-routing/), so I >>> don't think that our ID structure is a problem in itself. But I will >>> follow up on the large document idea. >>> >>> I used this article ( >>> https://support.datastax.com/entries/38367716-Solr-Configuration-Best-Practices-and-Troubleshooting-Tips) >>> to find the index heap and disk usage: >>> http://localhost:8983/solr/admin/cores?action=STATUS&memory=true >>> >>> Though looking at the data index directory on disk basically said the >>> same thing. >>> >>> I am pretty sure we're using the smart round-robining client, but I will >>> double check on Monday. >>> >>> We have been using CollectD and graphite to monitor our VMs, as well as >>> jvisualvm, though we haven't tried SPM. >>> >>> Thanks for all the ideas, guys. >>> >>> >>> On Sat, May 31, 2014 at 11:54 PM, Otis Gospodnetic < >>> otis.gospodnetic@gmail.com> wrote: >>> >>>> Hi Joe, >>>> >>>> Are you/how are you sure all 3 shards are roughly the same size? Can you >>>> share what you run/see that shows you that? >>>> >>>> Are you sure queries are evenly distributed? Something like SPM >>>> should give you insight into that. >>>> >>>> How big are your caches? >>>> >>>> Otis >>>> -- >>>> Performance Monitoring * Log Analytics * Search Analytics >>>> Solr & Elasticsearch Support * http://sematext.com/ >>>> >>>> >>>> On Sat, May 31, 2014 at 5:54 PM, Joe Gresock wrote: >>>> >>>>> Interesting thought about the routing. Our document ids are in 3 >>>> parts: >>>>> <10-digit identifier>!! >>>>> >>>>> e.g., 5/12345678!130000025603!TEXT >>>>> >>>>> Each object has an identifier, and there may be multiple versions of >>>> the >>>>> object, hence the timestamp. We like to be able to pull back all of >>>> the >>>>> versions of an object at once, hence the routing scheme. >>>>> >>>>> The nature of the identifier is that a great many of them begin with a >>>>> certain number. I'd be interested to know more about the hashing >>>> scheme >>>>> used for the document routing. Perhaps the first character gives it >>>> more >>>>> weight as to which shard it lands in? >>>>> >>>>> It seems strange that certain of the most highly-searched documents >>>> would >>>>> happen to fall on this shard, but you may be onto something. We'll >>>> scrape >>>>> through some non-distributed queries and see what we can find. >>>>> >>>>> >>>>> On Sat, May 31, 2014 at 1:47 PM, Erick Erickson < >>>> erickerickson@gmail.com> >>>>> wrote: >>>>> >>>>>> This is very weird. >>>>>> >>>>>> Are you sure that all the Java versions are identical? And all the >>>> JVM >>>>>> parameters are the same? Grasping at straws here. >>>>>> >>>>>> More grasping at straws: I'm a little suspicious that you are using >>>>>> routing. You say that the indexes are about the same size, but is it >>>> is >>>>>> possible that your routing is somehow loading the problem shard >>>>> abnormally? >>>>>> By that I mean somehow the documents on that shard are different, or >>>>> have a >>>>>> drastically higher number of hits than the other shards? >>>>>> >>>>>> You can fire queries at shards with &distrib=false and NOT have it >>>> go to >>>>>> other shards, perhaps if you can isolate the problem queries that >>>> might >>>>>> shed some light on the problem. >>>>>> >>>>>> >>>>>> Best >>>>>> Erick@Baffled.com >>>>>> >>>>>> >>>>>> On Sat, May 31, 2014 at 8:33 AM, Joe Gresock >>>> wrote: >>>>>>> It has taken as little as 2 minutes to happen the last time we >>>> tried. >>>>> It >>>>>>> basically happens upon high query load (peak user hours during the >>>>> day). >>>>>>> When we reduce functionality by disabling most searches, it >>>>> stabilizes. >>>>>>> So it really is only on high query load. Our ingest rate is >>>> fairly >>>>> low. >>>>>>> It happens no matter how many nodes in the shard are up. >>>>>>> >>>>>>> >>>>>>> Joe >>>>>>> >>>>>>> >>>>>>> On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky < >>>>>> jack@basetechnology.com> >>>>>>> wrote: >>>>>>> >>>>>>>> When you restart, how long does it take it hit the problem? And >>>> how >>>>>> much >>>>>>>> query or update activity is happening in that time? Is there any >>>>> other >>>>>>>> activity showing up in the log? >>>>>>>> >>>>>>>> If you bring up only a single node in that problematic shard, do >>>> you >>>>>>> still >>>>>>>> see the problem? >>>>>>>> >>>>>>>> -- Jack Krupansky >>>>>>>> >>>>>>>> -----Original Message----- From: Joe Gresock >>>>>>>> Sent: Saturday, May 31, 2014 9:34 AM >>>>>>>> To: solr-user@lucene.apache.org >>>>>>>> Subject: Uneven shard heap usage >>>>>>>> >>>>>>>> >>>>>>>> Hi folks, >>>>>>>> >>>>>>>> I'm trying to figure out why one shard of an evenly-distributed >>>>> 3-shard >>>>>>>> cluster would suddenly start running out of heap space, after 9+ >>>>> months >>>>>>> of >>>>>>>> stable performance. We're using the "!" delimiter in our ids to >>>>>>> distribute >>>>>>>> the documents, and indeed the disk size of our shards are very >>>>> similar >>>>>>>> (31-32GB on disk per replica). >>>>>>>> >>>>>>>> Our setup is: >>>>>>>> 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, >>>> so >>>>>>>> basically 2 physical CPUs), 24GB disk >>>>>>>> 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever). >>>> We >>>>>>>> reserve 10g heap for each solr instance. >>>>>>>> Also 3 zookeeper VMs, which are very stable >>>>>>>> >>>>>>>> Since the troubles started, we've been monitoring all 9 with >>>>> jvisualvm, >>>>>>> and >>>>>>>> shards 2 and 3 keep a steady amount of heap space reserved, >>>> always >>>>>> having >>>>>>>> horizontal lines (with some minor gc). They're using 4-5GB >>>> heap, and >>>>>>> when >>>>>>>> we force gc using jvisualvm, they drop to 1GB usage. Shard 1, >>>>> however, >>>>>>>> quickly has a steep slope, and eventually has concurrent mode >>>>> failures >>>>>> in >>>>>>>> the gc logs, requiring us to restart the instances when they can >>>> no >>>>>>> longer >>>>>>>> do anything but gc. >>>>>>>> >>>>>>>> We've tried ruling out physical host problems by moving all 3 >>>> Shard 1 >>>>>>>> replicas to different hosts that are underutilized, however we >>>> still >>>>>> get >>>>>>>> the same problem. We'll still be working on ruling out >>>>> infrastructure >>>>>>>> issues, but I wanted to ask the questions here in case it makes >>>>> sense: >>>>>>>> * Does it make sense that all the replicas on one shard of a >>>> cluster >>>>>>> would >>>>>>>> have heap problems, when the other shard replicas do not, >>>> assuming a >>>>>>> fairly >>>>>>>> even data distribution? >>>>>>>> * One thing we changed recently was to make all of our fields >>>> stored, >>>>>>>> instead of only half of them. This was to support atomic >>>> updates. >>>>> Can >>>>>>>> stored fields, even though lazily loaded, cause problems like >>>> this? >>>>>>>> Thanks for any input, >>>>>>>> Joe >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> I know what it is to be in need, and I know what it is to have >>>>> plenty. >>>>>> I >>>>>>>> have learned the secret of being content in any and every >>>> situation, >>>>>>>> whether well fed or hungry, whether living in plenty or in want. >>>> I >>>>> can >>>>>>> do >>>>>>>> all this through him who gives me strength. *-Philippians >>>> 4:12-13* >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> I know what it is to be in need, and I know what it is to have >>>> plenty. >>>>> I >>>>>>> have learned the secret of being content in any and every >>>> situation, >>>>>>> whether well fed or hungry, whether living in plenty or in want. >>>> I can >>>>>> do >>>>>>> all this through him who gives me strength. *-Philippians >>>> 4:12-13* >>>>> >>>>> >>>>> -- >>>>> I know what it is to be in need, and I know what it is to have plenty. >>>> I >>>>> have learned the secret of being content in any and every situation, >>>>> whether well fed or hungry, whether living in plenty or in want. I >>>> can do >>>>> all this through him who gives me strength. *-Philippians 4:12-13* >>>>> >>> >>> >>> -- >>> I know what it is to be in need, and I know what it is to have plenty. I >>> have learned the secret of being content in any and every situation, >>> whether well fed or hungry, whether living in plenty or in want. I can >>> do all this through him who gives me strength. *-Philippians 4:12-13* >>> >> >> >> -- >> I know what it is to be in need, and I know what it is to have plenty. I >> have learned the secret of being content in any and every situation, >> whether well fed or hungry, whether living in plenty or in want. I can >> do all this through him who gives me strength. *-Philippians 4:12-13* >> > >