Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C43D4C34B for ; Wed, 17 Jul 2013 09:50:15 +0000 (UTC) Received: (qmail 39898 invoked by uid 500); 17 Jul 2013 09:50:13 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 39668 invoked by uid 500); 17 Jul 2013 09:50:13 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 39660 invoked by uid 99); 17 Jul 2013 09:50:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jul 2013 09:50:12 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a45.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jul 2013 09:50:08 +0000 Received: from homiemail-a45.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a45.g.dreamhost.com (Postfix) with ESMTP id E7F7C480A9 for ; Wed, 17 Jul 2013 02:49:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=OCTg8YXkd/xJzZoEp8OpyO/eoe 0=; b=UtSXS/wAmkuSmBk6CIcTR5s9pxfSmLofTsrH5nBVOdw06U9xQDKqoCPJ4d W+6ZAp4USUHR8VmuVyrJ+5ksMT0eLYB9PTlyGjkm/SlkriQtN5LbP8W8JUMCM6E9 Zi/sW+OIPK7m4udH7tUSV3rLb/LDiFOhuI3U0UaT7LyGqfUcI= Received: from [172.16.1.7] (unknown [203.86.207.101]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a45.g.dreamhost.com (Postfix) with ESMTPSA id 3FB3648091 for ; Wed, 17 Jul 2013 02:49:47 -0700 (PDT) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_ECD4886D-AAB2-416D-85BD-212A3D2408BF" Message-Id: <45CA39C7-7AB4-44A4-8AE5-140A44DA1C41@thelastpickle.com> Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: cassandra GC cpu usage Date: Wed, 17 Jul 2013 21:49:45 +1200 References: <201307161148.06619.jure.koren@zemanta.com> To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1508) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_ECD4886D-AAB2-416D-85BD-212A3D2408BF Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Dive into the logs and look for messages from the GCInspector. These log = ParNew and CMS activity that takes over 200 ms. To get further insight = consider enabling the full GC logging (see cassandra-env.sh) on one of = the problem nodes.=20 Looking at your graphs you are getting about 2 ParNew collections a = second that are running around 130ms, so the server is pausing for about = 260ms per second to do ParNew. Which is not great.=20 CMS activity can also suck up CPU specially if it's not able to drain = the tenured heap. ParNew activity is more of a measure of the throughput on the node. Can = you correlate the problems with application load? Does it happen at = regular intervals ? Can you correlate it with repaur or compaction = processes ? Hope that helps=20 ----------------- Aaron Morton Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 17/07/2013, at 12:14 AM, Mohit Anchlia = wrote: > What's your replication factor? Can you check tp stats and net stats = to see if you are getting more mutations on these nodes ? >=20 > Sent from my iPhone >=20 > On Jul 16, 2013, at 3:18 PM, Jure Koren = wrote: >=20 >> Hi C* user list, >>=20 >> I have a curious recurring problem with Cassandra 1.2 and what seems = like a GC issue. >>=20 >> The cluster looks somewhat well balanced, all nodes are running = HotSpot JVM 1.6.0_31-b04 and cassandra 1.2.3. >>=20 >> Address Rack Status State Load Owns >> 10.2.3.6 RAC6 Up Normal 15.13 GB 12.71% >> 10.2.3.5 RAC5 Up Normal 16.87 GB 13.57% >> 10.2.3.8 RAC8 Up Normal 13.27 GB 13.71% >> 10.2.3.1 RAC1 Up Normal 16.46 GB 14.08% >> 10.2.3.7 RAC7 Up Normal 11.59 GB 14.34% >> 10.2.3.2 RAC2 Up Normal 23.15 GB 15.12% >> 10.2.3.4 RAC4 Up Normal 16.52 GB 16.47% >>=20 >> Every now and then (roughly once a month, currently), two nodes = (always the same two) need to be restarted after they start eating all = available CPU cycles and read and write latencies increase dramatically. = Restart fixes this every time. >>=20 >> The only metric that significantly deviates from the average for all = nodes shows GC doing something: http://bou.si/rest/parnew.png >>=20 >> Is there a way to debug this? After searching online it appears as = nobody has really solved this problem and I have no idea what could = cause such behaviour in just two particular cluster nodes. >>=20 >> I'm now thinking of decomissioning the problematic nodes and = bootstrapping them anew, but can't decide if this could possibly help. >>=20 >> Thanks in advance for any insight anyone might offer, >>=20 >> -- >> Jure Koren, DevOps >> http://www.zemanta.com/ --Apple-Mail=_ECD4886D-AAB2-416D-85BD-212A3D2408BF Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii Dive = into the logs and look for messages from the GCInspector. These log = ParNew and CMS activity that takes over 200 ms. To get further insight = consider enabling the full GC logging (see cassandra-env.sh) on one of = the problem nodes. 

Looking at your graphs you = are getting about 2 ParNew collections a second that are running around = 130ms, so the server is pausing for about 260ms per second to do ParNew. = Which is not great. 

CMS activity can also = suck up CPU specially if  it's not able to drain the tenured = heap.

ParNew activity is more of a measure of = the throughput on the node. Can you correlate the problems with = application load? Does it happen at regular intervals ? Can you = correlate it with repaur or compaction processes = ?

Hope that = helps 

http://www.thelastpickle.com

On 17/07/2013, at 12:14 AM, Mohit Anchlia <mohitanchlia@gmail.com> = wrote:

What's your replication factor? Can you check tp stats and = net stats to see if you are getting more mutations on these nodes = ?

Sent from my iPhone

On Jul 16, 2013, at 3:18 PM, Jure = Koren <jure.koren@zemanta.com> = wrote:

Hi C* user list,

I have a = curious recurring problem with Cassandra 1.2 and what seems like a GC = issue.

The cluster looks somewhat well balanced, all nodes are = running HotSpot JVM 1.6.0_31-b04 and cassandra 1.2.3.

Address = Rack Status State Load Owns
10.2.3.6 RAC6 Up Normal 15.13 GB = 12.71%
10.2.3.5 RAC5 Up Normal 16.87 GB 13.57%
10.2.3.8 RAC8 Up = Normal 13.27 GB 13.71%
10.2.3.1 RAC1 Up Normal 16.46 GB = 14.08%
10.2.3.7 RAC7 Up Normal 11.59 GB 14.34%
10.2.3.2 RAC2 Up = Normal 23.15 GB 15.12%
10.2.3.4 RAC4 Up Normal 16.52 GB = 16.47%

Every now and then (roughly once a month, currently), two = nodes (always the same two) need to be restarted after they start eating = all available CPU cycles and read and write latencies increase = dramatically. Restart fixes this every time.

The only metric that = significantly deviates from the average for all nodes shows GC doing = something: http://bou.si/rest/parnew.png
Is there a way to debug this? After searching online it appears as = nobody has really solved this problem and I have no idea what could = cause such behaviour in just two particular cluster nodes.

I'm = now thinking of decomissioning the problematic nodes and bootstrapping = them anew, but can't decide if this could possibly help.

Thanks = in advance for any insight anyone might offer,

--
Jure Koren, = DevOps
http://www.zemanta.com/

= --Apple-Mail=_ECD4886D-AAB2-416D-85BD-212A3D2408BF--