Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D42EE10328 for ; Wed, 21 Aug 2013 06:36:15 +0000 (UTC) Received: (qmail 44389 invoked by uid 500); 21 Aug 2013 06:36:13 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 44032 invoked by uid 500); 21 Aug 2013 06:35:58 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 43717 invoked by uid 99); 21 Aug 2013 06:35:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Aug 2013 06:35:55 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of kwright@nanigans.com designates 216.82.251.14 as permitted sender) Received: from [216.82.251.14] (HELO mail1.bemta12.messagelabs.com) (216.82.251.14) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Aug 2013 06:35:49 +0000 Received: from [216.82.250.115:64559] by server-14.bemta-12.messagelabs.com id 9E/B8-28072-CAF54125; Wed, 21 Aug 2013 06:35:24 +0000 X-Env-Sender: kwright@nanigans.com X-Msg-Ref: server-16.tower-127.messagelabs.com!1377066914!8218468!8 X-Originating-IP: [199.119.192.77] X-StarScan-Received: X-StarScan-Version: 6.9.11; banners=-,-,- X-VirusChecked: Checked Received: (qmail 7461 invoked from network); 21 Aug 2013 06:35:22 -0000 Received: from unknown.apptix.net (HELO out001.collaborationhost.net) (199.119.192.77) by server-16.tower-127.messagelabs.com with RC4-SHA encrypted SMTP; 21 Aug 2013 06:35:22 -0000 Received: from AUSP01VMBX28.collaborationhost.net ([192.168.20.73]) by AUSP01MHUB51.collaborationhost.net ([10.2.69.61]) with mapi; Wed, 21 Aug 2013 01:35:21 -0500 From: Keith Wright To: "user@cassandra.apache.org" Date: Wed, 21 Aug 2013 01:35:20 -0500 Subject: Re: Nodes get stuck Thread-Topic: Nodes get stuck Thread-Index: Ac6eOJvcm6Wa1nqsRNKVKrZT0f+foA== Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.2.3.120616 acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_CE39D787150E0kwrightnaniganscom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_CE39D787150E0kwrightnaniganscom_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Still looking for help! We have stopped almost ALL traffic to the cluster = and still some nodes are showing almost 1000% CPU for cassandra with no ios= tat activity. We were running cleanup on one of the nodes that was not sh= owing load spikes however now when I attempt to stop cleanup there via node= tool stop cleanup the java task for stopping cleanup itself is at 1500% and= has not returned after 2 minutes. This is VERY odd behavior. Any ideas? = Hardware failure? Network? We are not seeing anything there but wanted t= o get ideas. Thanks From: Keith Wright > Reply-To: "user@cassandra.apache.org" > Date: Tuesday, August 20, 2013 8:32 PM To: "user@cassandra.apache.org" > Subject: Nodes get stuck Hi all, We are using C* 1.2.4 with Vnodes and SSD. We have seen behavior recen= tly where 3 of our nodes get locked up in high load in what appears to be a= GC spiral while the rest of the cluster (7 total nodes) appears fine. Whe= n I run a tpstats, I see the following (assuming tpstats returns at all) an= d top shows cassandra pegged at 2000%. Obviously we have a large number of= blocked reads. In the past I could explain this due to unexpectedly wide = rows however we have handled that. When the cluster starts to meltdown lik= e this its hard to get visibility into what's going on and what triggered t= he issue as everything starts to pile on. Opscenter becomes unusable and b= ecause the effected nodes are in GC pressure, getting any data via nodetool= or JMX is also difficult. What do people do to handle these situations? = We are going to start graphing reads/writes/sec/CF to Ganglia in the hopes = that it helps. Thanks Pool Name Active Pending Completed Blocked All= time blocked ReadStage 256 381 1245117434 0 = 0 RequestResponseStage 0 0 1161495947 0 = 0 MutationStage 8 8 481721887 0 = 0 ReadRepairStage 0 0 85770600 0 = 0 ReplicateOnWriteStage 0 0 21896804 0 = 0 GossipStage 0 0 1546196 0 = 0 AntiEntropyStage 0 0 5009 0 = 0 MigrationStage 0 0 1082 0 = 0 MemtablePostFlusher 0 0 10178 0 = 0 FlushWriter 0 0 6081 0 = 2075 MiscStage 0 0 57 0 = 0 commitlog_archiver 0 0 0 0 = 0 AntiEntropySessions 0 0 0 0 = 0 InternalResponseStage 0 0 6 0 = 0 HintedHandoff 1 1 246 0 = 0 Message type Dropped RANGE_SLICE 482 READ_REPAIR 0 BINARY 0 READ 515762 MUTATION 39 _TRACE 0 REQUEST_RESPONSE 29 --_000_CE39D787150E0kwrightnaniganscom_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
Still looking for help! =  We have stopped almost ALL traffic to the cluster and still some node= s are showing almost 1000% CPU for cassandra with no iostat activity.  = ; We were running cleanup on one of the nodes that was not showing load spi= kes however now when I attempt to stop cleanup there via nodetool stop clea= nup the java task for stopping cleanup itself is at 1500% and has not retur= ned after 2 minutes.  This is VERY odd behavior.  Any ideas? &nbs= p;Hardware failure?  Network?  We are not seeing anything there b= ut wanted to get ideas.

Thanks

From: Keith Wright <kwright@nanigans.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Tuesday, August 20, 2013 8:32 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Nodes get stuck

Hi all,

    We are = using C* 1.2.4 with Vnodes and SSD.  We have seen behavior recently wh= ere 3 of our nodes get locked up in high load in what appears to be a GC sp= iral while the rest of the cluster (7 total nodes) appears fine.  When= I run a tpstats, I see the following (assuming tpstats returns at all) and top shows cassandra pegged= at 2000%.  Obviously we have a large number of blocked reads.  I= n the past I could explain this due to unexpectedly wide rows however we ha= ve handled that.  When the cluster starts to meltdown like this its hard to get visibility into what's going on and = what triggered the issue as everything starts to pile on.  Opscenter b= ecomes unusable and because the effected nodes are in GC pressure, getting = any data via nodetool or JMX is also difficult.  What do people do to handle these situations?  We ar= e going to start graphing reads/writes/sec/CF to Ganglia in the hopes that = it helps.

Thanks

Pool Name   &nb= sp;                Active   Pe= nding      Completed   Blocked  All time blocked
ReadStage                 =       256       381     1245117434 =         0             &nb= sp;   0
RequestResponseStage         &nb= sp;    0         0     1161495947 &= nbsp;       0             &nbs= p;   0
MutationStage           &nbs= p;         8         8    = ;  481721887         0        =         0
ReadRepairStage     &nbs= p;             0         = 0       85770600         0    =             0
ReplicateOnWriteStag= e             0         0=       21896804         0     =             0
GossipStage   &n= bsp;                   0  = ;       0        1546196     &= nbsp;   0                 0
AntiEntropyStage               &= nbsp;  0         0         &nb= sp; 5009         0           &= nbsp;     0
MigrationStage         =            0         0 &n= bsp;         1082         0   =               0
MemtablePostFl= usher               0     &nbs= p;   0          10178       &n= bsp; 0                 0
= FlushWriter                   =     0         0         &= nbsp; 6081         0          =    2075
MiscStage           &= nbsp;             0       &nbs= p; 0             57       &nbs= p; 0                 0
co= mmitlog_archiver                0 &= nbsp;       0             &nbs= p;0         0            =     0
AntiEntropySessions         =       0         0       &= nbsp;      0         0     &nb= sp;           0
InternalResponseStage &n= bsp;           0         0 &nb= sp;            6         = 0                 0
Hinte= dHandoff                   &nb= sp; 1         1           &nbs= p;246         0           &nbs= p;     0

Message type     &nbs= p;     Dropped
RANGE_SLICE         =        482
READ_REPAIR       &= nbsp;          0
BINARY     &n= bsp;                 0
RE= AD                    515= 762
MUTATION               &nb= sp;    39
_TRACE           &nb= sp;           0
REQUEST_RESPONSE   =          29

=
--_000_CE39D787150E0kwrightnaniganscom_--