Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of kwright@nanigans.com designates
 216.82.251.14 as permitted sender)
From: Keith Wright <kwright@nanigans.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wed, 21 Aug 2013 01:35:20 -0500
Subject: Re: Nodes get stuck
Thread-Topic: Nodes get stuck
Thread-Index: Ac6eOJvcm6Wa1nqsRNKVKrZT0f+foA==
Message-ID: <CE39D787.150E0%kwright@nanigans.com>
In-Reply-To: <CE3982F2.14F83%kwright@nanigans.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.3.120616
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_CE39D787150E0kwrightnaniganscom_"
MIME-Version: 1.0

--_000_CE39D787150E0kwrightnaniganscom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Still looking for help!  We have stopped almost ALL traffic to the cluster =
and still some nodes are showing almost 1000% CPU for cassandra with no ios=
tat activity.   We were running cleanup on one of the nodes that was not sh=
owing load spikes however now when I attempt to stop cleanup there via node=
tool stop cleanup the java task for stopping cleanup itself is at 1500% and=
 has not returned after 2 minutes.  This is VERY odd behavior.  Any ideas? =
 Hardware failure?  Network?  We are not seeing anything there but wanted t=
o get ideas.

Thanks

From: Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <us=
er@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, August 20, 2013 8:32 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cas=
sandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Nodes get stuck

Hi all,

    We are using C* 1.2.4 with Vnodes and SSD.  We have seen behavior recen=
tly where 3 of our nodes get locked up in high load in what appears to be a=
 GC spiral while the rest of the cluster (7 total nodes) appears fine.  Whe=
n I run a tpstats, I see the following (assuming tpstats returns at all) an=
d top shows cassandra pegged at 2000%.  Obviously we have a large number of=
 blocked reads.  In the past I could explain this due to unexpectedly wide =
rows however we have handled that.  When the cluster starts to meltdown lik=
e this its hard to get visibility into what's going on and what triggered t=
he issue as everything starts to pile on.  Opscenter becomes unusable and b=
ecause the effected nodes are in GC pressure, getting any data via nodetool=
 or JMX is also difficult.  What do people do to handle these situations?  =
We are going to start graphing reads/writes/sec/CF to Ganglia in the hopes =
that it helps.

Thanks

Pool Name                    Active   Pending      Completed   Blocked  All=
 time blocked
ReadStage                       256       381     1245117434         0     =
            0
RequestResponseStage              0         0     1161495947         0     =
            0
MutationStage                     8         8      481721887         0     =
            0
ReadRepairStage                   0         0       85770600         0     =
            0
ReplicateOnWriteStage             0         0       21896804         0     =
            0
GossipStage                       0         0        1546196         0     =
            0
AntiEntropyStage                  0         0           5009         0     =
            0
MigrationStage                    0         0           1082         0     =
            0
MemtablePostFlusher               0         0          10178         0     =
            0
FlushWriter                       0         0           6081         0     =
         2075
MiscStage                         0         0             57         0     =
            0
commitlog_archiver                0         0              0         0     =
            0
AntiEntropySessions               0         0              0         0     =
            0
InternalResponseStage             0         0              6         0     =
            0
HintedHandoff                     1         1            246         0     =
            0

Message type           Dropped
RANGE_SLICE                482
READ_REPAIR                  0
BINARY                       0
READ                    515762
MUTATION                    39
_TRACE                       0
REQUEST_RESPONSE            29


--_000_CE39D787150E0kwrightnaniganscom_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode:=
 space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-si=
ze: 14px; font-family: Calibri, sans-serif; "><div>Still looking for help! =
&nbsp;We have stopped almost ALL traffic to the cluster and still some node=
s are showing almost 1000% CPU for cassandra with no iostat activity. &nbsp=
; We were running cleanup on one of the nodes that was not showing load spi=
kes however now when I attempt to stop cleanup there via nodetool stop clea=
nup the java task for stopping cleanup itself is at 1500% and has not retur=
ned after 2 minutes. &nbsp;This is VERY odd behavior. &nbsp;Any ideas? &nbs=
p;Hardware failure? &nbsp;Network? &nbsp;We are not seeing anything there b=
ut wanted to get ideas.</div><div><br></div><div>Thanks</div><div><br></div=
><span id=3D"OLK_SRC_BODY_SECTION"><div style=3D"font-family:Calibri; font-=
size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER=
-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: =
0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP:=
 3pt"><span style=3D"font-weight:bold">From: </span> Keith Wright &lt;<a hr=
ef=3D"mailto:kwright@nanigans.com">kwright@nanigans.com</a>&gt;<br><span st=
yle=3D"font-weight:bold">Reply-To: </span> "<a href=3D"mailto:user@cassandr=
a.apache.org">user@cassandra.apache.org</a>" &lt;<a href=3D"mailto:user@cas=
sandra.apache.org">user@cassandra.apache.org</a>&gt;<br><span style=3D"font=
-weight:bold">Date: </span> Tuesday, August 20, 2013 8:32 PM<br><span style=
=3D"font-weight:bold">To: </span> "<a href=3D"mailto:user@cassandra.apache.=
org">user@cassandra.apache.org</a>" &lt;<a href=3D"mailto:user@cassandra.ap=
ache.org">user@cassandra.apache.org</a>&gt;<br><span style=3D"font-weight:b=
old">Subject: </span> Nodes get stuck<br></div><div><br></div><div><meta ht=
tp-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-8"><div style=
=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: af=
ter-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri=
, sans-serif; "><div>Hi all,</div><div><br></div><div>&nbsp; &nbsp; We are =
using C* 1.2.4 with Vnodes and SSD. &nbsp;We have seen behavior recently wh=
ere 3 of our nodes get locked up in high load in what appears to be a GC sp=
iral while the rest of the cluster (7 total nodes) appears fine. &nbsp;When=
 I run a tpstats, I see the
 following (assuming tpstats returns at all) and top shows cassandra pegged=
 at 2000%. &nbsp;Obviously we have a large number of blocked reads. &nbsp;I=
n the past I could explain this due to unexpectedly wide rows however we ha=
ve handled that. &nbsp;When the cluster starts
 to meltdown like this its hard to get visibility into what's going on and =
what triggered the issue as everything starts to pile on. &nbsp;Opscenter b=
ecomes unusable and because the effected nodes are in GC pressure, getting =
any data via nodetool or JMX is also
 difficult. &nbsp;What do people do to handle these situations? &nbsp;We ar=
e going to start graphing reads/writes/sec/CF to Ganglia in the hopes that =
it helps.</div><div><br></div><div>Thanks</div><div><br></div><div><div sty=
le=3D"font-family: Consolas; font-size: medium; "><div>Pool Name &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Active &nbsp; Pe=
nding &nbsp; &nbsp; &nbsp;Completed &nbsp; Blocked &nbsp;All time blocked</=
div><div>ReadStage &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; 256 &nbsp; &nbsp; &nbsp; 381 &nbsp; &nbsp; 1245117434 =
&nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; 0</div><div>RequestResponseStage &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; 1161495947 &=
nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; 0</div><div>MutationStage &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; 8 &nbsp; &nbsp; &nbsp; &nbsp; 8 &nbsp; &nbsp=
; &nbsp;481721887 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>ReadRepairStage &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; =
0 &nbsp; &nbsp; &nbsp; 85770600 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp;=
 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>ReplicateOnWriteStag=
e &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; 0=
 &nbsp; &nbsp; &nbsp; 21896804 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>GossipStage &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp=
; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp;1546196 &nbsp; &nbsp; &=
nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</d=
iv><div>AntiEntropyStage &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; 5009 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; 0</div><div>MigrationStage &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; &nbsp; 0 &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; 1082 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>MemtablePostFl=
usher &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbs=
p; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;10178 &nbsp; &nbsp; &nbsp; &n=
bsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>=
FlushWriter &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp; 6081 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp;2075</div><div>MiscStage &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbs=
p; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 57 &nbsp; &nbsp; &nbsp; &nbs=
p; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>co=
mmitlog_archiver &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0 &=
nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p;0 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;=
 &nbsp; &nbsp; 0</div><div>AntiEntropySessions &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp;0 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>InternalResponseStage &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;6 &nbsp; &nbsp; &nbsp; &nbsp; =
0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>Hinte=
dHandoff &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; 1 &nbsp; &nbsp; &nbsp; &nbsp; 1 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p;246 &nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; 0</div><div><br></div><div>Message type &nbsp; &nbsp; &nbs=
p; &nbsp; &nbsp; Dropped</div><div>RANGE_SLICE &nbsp; &nbsp; &nbsp; &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp;482</div><div>READ_REPAIR &nbsp; &nbsp; &nbsp; &=
nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;0</div><div>BINARY &nbsp; &nbsp; &n=
bsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>RE=
AD &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;515=
762</div><div>MUTATION &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp;39</div><div>_TRACE &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nb=
sp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0</div><div>REQUEST_RESPONSE &nbsp; =
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;29</div></div></div><div><br></div></div>=
</div></span></body></html>

--_000_CE39D787150E0kwrightnaniganscom_--