Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 938EB200BB1 for ; Thu, 3 Nov 2016 15:46:56 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 924D4160AFF; Thu, 3 Nov 2016 14:46:56 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B5366160AFE for ; Thu, 3 Nov 2016 15:46:55 +0100 (CET) Received: (qmail 80312 invoked by uid 500); 3 Nov 2016 14:46:54 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 80302 invoked by uid 99); 3 Nov 2016 14:46:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Nov 2016 14:46:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 03C93C12AB for ; Thu, 3 Nov 2016 14:46:53 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.48 X-Spam-Level: ** X-Spam-Status: No, score=2.48 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=zalando-de.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id sKnDUWV1SfoK for ; Thu, 3 Nov 2016 14:46:50 +0000 (UTC) Received: from mail-qk0-f182.google.com (mail-qk0-f182.google.com [209.85.220.182]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id B1F245F3BA for ; Thu, 3 Nov 2016 14:46:49 +0000 (UTC) Received: by mail-qk0-f182.google.com with SMTP id q130so59755977qke.1 for ; Thu, 03 Nov 2016 07:46:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zalando-de.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=HCtScwG1X0I3P2VfZc64TKFCgj+qovz9swqJ9TwuWv4=; b=r9aka3JoZFmZRYfqgLXjyx95PC9PtDJmomcLV2FFsNorhERBG3jqWKb1DzPoLwRaYA tILmu/U312v4NCGFFzIxEjUT/mXKK1pJuSryKLGAcpCyCOZhQSwUKB17bj/9ED8H0lWV TN6cE769VXB11pdOoqvlRT9DhVXhW6q3k7l2mWACLLHN2a5hiO1Wcl0Pat5iTxANNKor Vid1kC2MBJmwwmc4P7ScLBO+tZ+mc4h7q5JcYUqqA4apaIi7QKswlqsMlH9Le8SIhO3Z hcsgOBqjBSNTW1nSDSfm/cSXT7drXzuw2URATKII5AZ6gKqddQl01PbCIg91o3D+VjAb PZAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=HCtScwG1X0I3P2VfZc64TKFCgj+qovz9swqJ9TwuWv4=; b=M4shrSWuMQJjSt6bpYGETMJ9CwYpj4ed60cN2infXppvIaL8ZqljGitFyCHCswwrcW th5MUIO3TTfgK+BO/MJgsVYit+Tgqd3IevZOKqQaafpkoM6d9eC4iieZrrcqdR0HAsOR efRyUCgqpHwSSliAsHXpyHX5CZLINl2YpX5S8x5QqcL06BbcQVDbOznw/r/11uH+ii+3 RREUt0hom7vjYmjjzQfUlAO4dM9rMT27IqJrH2eZvemZuy166YHDQUEAnv6xsJXxZl+K 56wh0KOKgfVWNw3qUtDojW6oTqDGW9c+vM1GHM4FCEus2pXkb7qI2yiQv7iUn84KU7LA Aq/A== X-Gm-Message-State: ABUngvfyiI0qjZh1o0lmGZ4OheuirxitrRczo0GuhA5sVSdCk9fInR/nT82CNLIc4XZdTK7C/xy5Von+/8Mjrebi X-Received: by 10.55.72.22 with SMTP id v22mr7817299qka.50.1478184408622; Thu, 03 Nov 2016 07:46:48 -0700 (PDT) MIME-Version: 1.0 Received: by 10.12.163.131 with HTTP; Thu, 3 Nov 2016 07:46:18 -0700 (PDT) In-Reply-To: References: From: Oleksandr Shulgin Date: Thu, 3 Nov 2016 15:46:18 +0100 Message-ID: Subject: Re: failing bootstraps with OOM To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a114a893c96d6df054066a053 archived-at: Thu, 03 Nov 2016 14:46:56 -0000 --001a114a893c96d6df054066a053 Content-Type: text/plain; charset=UTF-8 On Thu, Nov 3, 2016 at 2:32 PM, Mike Torra wrote: > Hi Alex - I do monitor sstable counts and pending compactions, but > probably not closely enough. In 3/4 regions the cluster is running in, both > counts are very high - ~30-40k sstables for one particular CF, and on many > nodes >1k pending compactions. > It is generally a good idea to try to keep the number of pending compactions minimal. We usually see it is close to zero on every node during normal operations and less than some tens during maintenance such as repair. I had noticed this before, but I didn't have a good sense of what a "high" > number for these values was. > I would say anything higher than 20 probably requires someone to have a look and over 1k is very troublesome. It makes sense to me why this would cause the issues I've seen. After > increasing concurrent_compactors and compaction_throughput_mb_per_sec (to > 8 and 64mb, respectively), I'm starting to see those counts go down > steadily. Hopefully that will resolve the OOM issues, but it looks like it > will take a while for compactions to catch up. > > Thanks for the suggestions, Alex > Welcome. :-) -- Alex --001a114a893c96d6df054066a053 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On T= hu, Nov 3, 2016 at 2:32 PM, Mike Torra <mtorra@demandware.com><= /span> wrote:
Hi Alex - I do monitor sstable counts and pending compactions, but pro= bably not closely enough. In 3/4 regions the cluster is running in, both co= unts are very high - ~30-40k sstables for one particular CF, and on many no= des >1k pending compactions.

It is generally a good idea to try to keep the number of pending compactio= ns minimal.=C2=A0 We usually see it is close to zero on every node during n= ormal operations and less than some tens during maintenance such as repair.=

I had noticed this before, but I didn't have a good sense of what a "hi= gh" number for these values was.

I would say anything higher than 20 probably requires someone to hav= e a look and over 1k is very troublesome.

It makes sense to me why this would cause the issues I've seen. Af= ter increasing concurrent_compactors and compaction_throughput_mb_per_= sec (to 8 and 64mb, respectively), I'm starting to see those counts go = down steadily. Hopefully that will resolve the OOM issues, but it looks like it will take a while for compactions to catc= h up.

Thanks for the suggestions, Alex

Welc= ome. :-)

--
Alex

--001a114a893c96d6df054066a053--