Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of jlacefield@datastax.com
 designates 209.85.128.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CA+VSrLquG168AGDWbN8=N-P=etZhG8eqM9xn5htC5HVp2A2exQ@mail.gmail.com>
References: 
 <CA+VSrLqjzcDn=m3SCvkUDuK+d434e-8pvLyy_TdB9zYZKyKubg@mail.gmail.com>
	<5702854404177521145@unknownmsgid>
	<CA+VSrLquG168AGDWbN8=N-P=etZhG8eqM9xn5htC5HVp2A2exQ@mail.gmail.com>
Date: Wed, 18 Jun 2014 08:23:31 -0400
Message-ID: 
 <CAOP8aV9id6xTOAa=ixU2KWs2o3i6WCrYO1saztMVvLByHdWhDw@mail.gmail.com>
Subject: Re: restarting node makes cpu load of the entire cluster to raise
From: Jonathan Lacefield <jlacefield@datastax.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=001a1135ed2215f2b304fc1b5412

--001a1135ed2215f2b304fc1b5412
Content-Type: text/plain; charset=UTF-8

There are several long Parnew pauses that were recorded during startup.
 The young gen size looks large too, if I am reading that line correctly.
 Did you happen to overwrite the default settings for MAX_HEAP and/or NEW
size in the cassandra-env.sh?  The large you gen size, set via the env.sh
file, could be causing longer than typical pauses, which could make your
node appear to be unresponsive and have high CPU (CPU for the ParNew GC
event).

Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 ms for 2
collections, 1256307568 used; max is 8422162432
That is a 2 second GC pause.  That's very high for ParNew.  We typically
want a lot of tiny ParNew events as opposed to large, and less frequent,
ParNew events.

One other thing that was noticed, was that the node had a lot of log
segment replay's during startup.  You could avoid these, or minimize them,
by preforming a flush or drain before stopping and starting Cassandra.
 This will flush memtables and clear your log segments.


Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>

<http://www.datastax.com/cassandrasummit14>


On Wed, Jun 18, 2014 at 8:05 AM, Alain RODRIGUEZ <arodrime@gmail.com> wrote:

> A simple restart of a node with no changes give this result.
>
> logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26
>
> Here are some screenshot:
>
> - htop from a node immediatly after restarting
> - opscenter ring view (show load cpu on all nodes)
> - opscenter dashboard shows the impact of a restart on latency (can affect
> writes or reads, it depends, reaction seems to be quite random)
>
>
> 2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jlacefield@datastax.com>:
>
> Hello
>>
>>   Have you checked the log file to see what's happening during startup
>> ?   What caused the rolling restart?  Did you preform an upgrade or
>> change a config?
>>
>> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <arodrime@gmail.com>
>> wrote:
>> >
>> > Hi guys
>> >
>> > Using 1.2.11, when I try to rolling restart the cluster, any node I
>> restart makes the whole cluster cpu load to increase, reaching a "red"
>> state in opscenter (load from 3-4 to 20+). This happens once the node is
>> back online.
>> >
>> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop
>> mutations.
>> >
>> > I have tried to throttle handoff to 256 (instead of 1024), yet it
>> doesn't seems to help that much.
>> >
>> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing
>> problematic I think.
>> >
>> > Basically, what could be happening on node restart ? What is taking
>> that much CPU on every machine ? There is no steal or iowait.
>> >
>> > What can I try to tune ?
>> >
>>
>
>

--001a1135ed2215f2b304fc1b5412
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">There are several long Parnew pauses that were recorded du=
ring startup. =C2=A0The young gen size looks large too, if I am reading tha=
t line correctly. =C2=A0Did you happen to overwrite the default settings fo=
r MAX_HEAP and/or NEW size in the cassandra-env.sh? =C2=A0The large you gen=
 size, set via the env.sh file, could be causing longer than typical pauses=
, which could make your node appear to be unresponsive and have high CPU (C=
PU for the ParNew GC event). =C2=A0<div>
<br></div><div>Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 m=
s for 2 collections, 1256307568 used; max is 8422162432</div><div>That is a=
 2 second GC pause. =C2=A0That&#39;s very high for ParNew. =C2=A0We typical=
ly want a lot of tiny ParNew events as opposed to large, and less frequent,=
 ParNew events.</div>
<div><br></div><div>One other thing that was noticed, was that the node had=
 a lot of log segment replay&#39;s during startup. =C2=A0You could avoid th=
ese, or minimize them, by preforming a flush or drain before stopping and s=
tarting Cassandra. =C2=A0This will flush memtables and clear your log segme=
nts. =C2=A0</div>
<div><br></div><div><br></div></div><div class=3D"gmail_extra"><br clear=3D=
"all"><div><div dir=3D"ltr">Jonathan Lacefield<div>Solutions Architect, Dat=
aStax</div><div>(404) 822 3487</div><div><a href=3D"http://www.linkedin.com=
/in/jlacefield" target=3D"_blank"><img src=3D"http://s.c.lnkd.licdn.com/scd=
s/common/u/img/logos/logo_linkedin_92x22.png"></a><br>
</div><div><br></div><div><a href=3D"http://www.datastax.com/cassandrasummi=
t14" target=3D"_blank"><img src=3D"http://www.datastax.com/wp-content/theme=
s/datastax-2014/images/cassandrasummit2014/email-sig-csummit14-v2.png"></a>=
<br>
</div><div><br></div></div></div>
<br><br><div class=3D"gmail_quote">On Wed, Jun 18, 2014 at 8:05 AM, Alain R=
ODRIGUEZ <span dir=3D"ltr">&lt;<a href=3D"mailto:arodrime@gmail.com" target=
=3D"_blank">arodrime@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
<div dir=3D"ltr">A simple restart of a node with no changes give this resul=
t.<div><br></div><div>logs output :=C2=A0<a href=3D"https://gist.github.com=
/arodrime/db9ab152071d1ad39f26" target=3D"_blank">https://gist.github.com/a=
rodrime/db9ab152071d1ad39f26</a></div>


<div><br></div><div>Here are some screenshot:</div><div><br></div><div>- ht=
op from a node immediatly after restarting</div><div>- opscenter ring view =
(show load cpu on all nodes)</div><div>- opscenter dashboard shows the impa=
ct of a restart on latency (can affect writes or reads, it depends, reactio=
n seems to be quite random)</div>


<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2014-06-18 13=
:35 GMT+02:00 Jonathan Lacefield <span dir=3D"ltr">&lt;<a href=3D"mailto:jl=
acefield@datastax.com" target=3D"_blank">jlacefield@datastax.com</a>&gt;</s=
pan>:<div>
<div class=3D"h5"><br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hello<br>
<br>
=C2=A0 Have you checked the log file to see what&#39;s happening during sta=
rtup<br>
? =C2=A0 What caused the rolling restart? =C2=A0Did you preform an upgrade =
or<br>
change a config?<br>
<div><div><br>
&gt; On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ &lt;<a href=3D"mailto:aro=
drime@gmail.com" target=3D"_blank">arodrime@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt; Hi guys<br>
&gt;<br>
&gt; Using 1.2.11, when I try to rolling restart the cluster, any node I re=
start makes the whole cluster cpu load to increase, reaching a &quot;red&qu=
ot; state in opscenter (load from 3-4 to 20+). This happens once the node i=
s back online.<br>


&gt;<br>
&gt; The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop mu=
tations.<br>
&gt;<br>
&gt; I have tried to throttle handoff to 256 (instead of 1024), yet it does=
n&#39;t seems to help that much.<br>
&gt;<br>
&gt; Disks are not the bottleneck. PARNEW GC increase a bit, but nothing pr=
oblematic I think.<br>
&gt;<br>
&gt; Basically, what could be happening on node restart ? What is taking th=
at much CPU on every machine ? There is no steal or iowait.<br>
&gt;<br>
&gt; What can I try to tune ?<br>
&gt;<br>
</div></div></blockquote></div></div></div><br></div></div>
</blockquote></div><br></div>

--001a1135ed2215f2b304fc1b5412--