Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: aaron morton <aaron@thelastpickle.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_ECD4886D-AAB2-416D-85BD-212A3D2408BF"
Message-Id: <45CA39C7-7AB4-44A4-8AE5-140A44DA1C41@thelastpickle.com>
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Subject: Re: cassandra GC cpu usage
Date: Wed, 17 Jul 2013 21:49:45 +1200
References: <201307161148.06619.jure.koren@zemanta.com>
 <E1FFEC36-CA2E-4DF5-8A0E-6A952B6737E9@gmail.com>
To: user@cassandra.apache.org
In-Reply-To: <E1FFEC36-CA2E-4DF5-8A0E-6A952B6737E9@gmail.com>


--Apple-Mail=_ECD4886D-AAB2-416D-85BD-212A3D2408BF
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

Dive into the logs and look for messages from the GCInspector. These log =
ParNew and CMS activity that takes over 200 ms. To get further insight =
consider enabling the full GC logging (see cassandra-env.sh) on one of =
the problem nodes.=20

Looking at your graphs you are getting about 2 ParNew collections a =
second that are running around 130ms, so the server is pausing for about =
260ms per second to do ParNew. Which is not great.=20

CMS activity can also suck up CPU specially if  it's not able to drain =
the tenured heap.

ParNew activity is more of a measure of the throughput on the node. Can =
you correlate the problems with application load? Does it happen at =
regular intervals ? Can you correlate it with repaur or compaction =
processes ?

Hope that helps=20

-----------------
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/07/2013, at 12:14 AM, Mohit Anchlia <mohitanchlia@gmail.com> =
wrote:

> What's your replication factor? Can you check tp stats and net stats =
to see if you are getting more mutations on these nodes ?
>=20
> Sent from my iPhone
>=20
> On Jul 16, 2013, at 3:18 PM, Jure Koren <jure.koren@zemanta.com> =
wrote:
>=20
>> Hi C* user list,
>>=20
>> I have a curious recurring problem with Cassandra 1.2 and what seems =
like a GC issue.
>>=20
>> The cluster looks somewhat well balanced, all nodes are running =
HotSpot JVM 1.6.0_31-b04 and cassandra 1.2.3.
>>=20
>> Address Rack Status State Load Owns
>> 10.2.3.6 RAC6 Up Normal 15.13 GB 12.71%
>> 10.2.3.5 RAC5 Up Normal 16.87 GB 13.57%
>> 10.2.3.8 RAC8 Up Normal 13.27 GB 13.71%
>> 10.2.3.1 RAC1 Up Normal 16.46 GB 14.08%
>> 10.2.3.7 RAC7 Up Normal 11.59 GB 14.34%
>> 10.2.3.2 RAC2 Up Normal 23.15 GB 15.12%
>> 10.2.3.4 RAC4 Up Normal 16.52 GB 16.47%
>>=20
>> Every now and then (roughly once a month, currently), two nodes =
(always the same two) need to be restarted after they start eating all =
available CPU cycles and read and write latencies increase dramatically. =
Restart fixes this every time.
>>=20
>> The only metric that significantly deviates from the average for all =
nodes shows GC doing something: http://bou.si/rest/parnew.png
>>=20
>> Is there a way to debug this? After searching online it appears as =
nobody has really solved this problem and I have no idea what could =
cause such behaviour in just two particular cluster nodes.
>>=20
>> I'm now thinking of decomissioning the problematic nodes and =
bootstrapping them anew, but can't decide if this could possibly help.
>>=20
>> Thanks in advance for any insight anyone might offer,
>>=20
>> --
>> Jure Koren, DevOps
>> http://www.zemanta.com/


--Apple-Mail=_ECD4886D-AAB2-416D-85BD-212A3D2408BF
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Dive =
into the logs and look for messages from the&nbsp;GCInspector. These log =
ParNew and CMS activity that takes over 200 ms. To get further insight =
consider enabling the full GC logging (see cassandra-env.sh) on one of =
the problem nodes.&nbsp;<div><br></div><div>Looking at your graphs you =
are getting about 2 ParNew collections a second that are running around =
130ms, so the server is pausing for about 260ms per second to do ParNew. =
Which is not great.&nbsp;</div><div><br></div><div>CMS activity can also =
suck up CPU specially if &nbsp;it's not able to drain the tenured =
heap.</div><div><br></div><div>ParNew activity is more of a measure of =
the throughput on the node. Can you correlate the problems with =
application load? Does it happen at regular intervals ? Can you =
correlate it with repaur or compaction processes =
?</div><div><br></div><div>Hope that =
helps&nbsp;</div><div><br></div><div><div apple-content-edited=3D"true">
<div style=3D"color: rgb(0, 0, 0); font-family: Helvetica; font-size: =
medium; font-style: normal; font-variant: normal; font-weight: normal; =
letter-spacing: normal; line-height: normal; orphans: 2; text-align: =
-webkit-auto; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div =
style=3D"color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; =
font-style: normal; font-variant: normal; font-weight: normal; =
letter-spacing: normal; line-height: normal; orphans: 2; text-align: =
-webkit-auto; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div =
style=3D"color: rgb(0, 0, 0); font-family: Helvetica; font-size: medium; =
font-style: normal; font-variant: normal; font-weight: normal; =
letter-spacing: normal; line-height: normal; orphans: 2; text-align: =
-webkit-auto; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
border-spacing: 0px; -webkit-text-decorations-in-effect: none; =
-webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; =
font-size: medium; "><div style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; border-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div>-----------------</div><div>Aaron Morton</div><div>Cassandra =
Consultant</div><div>New =
Zealand</div><div><br></div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></span></div></span></div></span></div></span></div></div></div>
</div>
<br><div><div>On 17/07/2013, at 12:14 AM, Mohit Anchlia &lt;<a =
href=3D"mailto:mohitanchlia@gmail.com">mohitanchlia@gmail.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite">What's your replication factor? Can you check tp stats and =
net stats to see if you are getting more mutations on these nodes =
?<br><br>Sent from my iPhone<br><br>On Jul 16, 2013, at 3:18 PM, Jure =
Koren &lt;<a =
href=3D"mailto:jure.koren@zemanta.com">jure.koren@zemanta.com</a>&gt; =
wrote:<br><br><blockquote type=3D"cite">Hi C* user list,<br><br>I have a =
curious recurring problem with Cassandra 1.2 and what seems like a GC =
issue.<br><br>The cluster looks somewhat well balanced, all nodes are =
running HotSpot JVM 1.6.0_31-b04 and cassandra 1.2.3.<br><br>Address =
Rack Status State Load Owns<br>10.2.3.6 RAC6 Up Normal 15.13 GB =
12.71%<br>10.2.3.5 RAC5 Up Normal 16.87 GB 13.57%<br>10.2.3.8 RAC8 Up =
Normal 13.27 GB 13.71%<br>10.2.3.1 RAC1 Up Normal 16.46 GB =
14.08%<br>10.2.3.7 RAC7 Up Normal 11.59 GB 14.34%<br>10.2.3.2 RAC2 Up =
Normal 23.15 GB 15.12%<br>10.2.3.4 RAC4 Up Normal 16.52 GB =
16.47%<br><br>Every now and then (roughly once a month, currently), two =
nodes (always the same two) need to be restarted after they start eating =
all available CPU cycles and read and write latencies increase =
dramatically. Restart fixes this every time.<br><br>The only metric that =
significantly deviates from the average for all nodes shows GC doing =
something: <a =
href=3D"http://bou.si/rest/parnew.png">http://bou.si/rest/parnew.png</a><b=
r><br>Is there a way to debug this? After searching online it appears as =
nobody has really solved this problem and I have no idea what could =
cause such behaviour in just two particular cluster nodes.<br><br>I'm =
now thinking of decomissioning the problematic nodes and bootstrapping =
them anew, but can't decide if this could possibly help.<br><br>Thanks =
in advance for any insight anyone might offer,<br><br>--<br>Jure Koren, =
DevOps<br><a =
href=3D"http://www.zemanta.com/">http://www.zemanta.com/</a><br></blockquo=
te></blockquote></div><br></div></body></html>=

--Apple-Mail=_ECD4886D-AAB2-416D-85BD-212A3D2408BF--