Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 76CC217266 for ; Thu, 2 Jul 2015 20:44:05 +0000 (UTC) Received: (qmail 12839 invoked by uid 500); 2 Jul 2015 20:44:05 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 12808 invoked by uid 500); 2 Jul 2015 20:44:05 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 12797 invoked by uid 99); 2 Jul 2015 20:44:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jul 2015 20:44:05 +0000 Date: Thu, 2 Jul 2015 20:44:05 +0000 (UTC) From: "Alexander Piavlo (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-9720) half open tcp connections to cassandra cluster nodes cause 100% cpu load MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Piavlo updated CASSANDRA-9720: ---------------------------------------- Description: cassandra 2.1.5 We spotted that few of the nodes in our cluster got sudden cpu 100% spike which never ended. It's not a GC not increased reads/writes nodes. What we saw is that those nodes that have 100% cpu load all have some connections (file descriptios) with "can't identify protocol" which indicate those must be unprolery handled abrupt connections by cassandra process. http://stackoverflow.com/questions/7911840/seeing-too-many-lsof-cant-identify-protocol We are pretty sure what triggered this is the spark cassandra connector which sudenly started to get stuck in early discovery of cassandra nodes before running any stages We had to restart the affected cassandra processes to get the cpu back to normal ps. we had similar issues some time ago with earlier version, of 2.1.x cassandra branch, and ended up solving the problerm by upgrading from spark1.2.1 to spark1.3.1 and also upgrading spark datastax connecor accordingly. Now looks like the problem is back with 99.9% same symptoms ps2. We have observed previously several java/cassandra unrelated processes (mainly in php-cli) go crazy with cpu then they had "can't identify protocol" symphoms was: cassandra 2.1.5 We spotted that few of the nodes in our cluster got sudden cpu 100% spike which never ended. It's not a GC not increased reads/writes nodes. What we saw is that those nodes that have 100% cpu load all have some connections (file descriptios) with "can't identify protocol" which indicate those must be unprolery handled abrupt connections by cassandra process. http://stackoverflow.com/questions/7911840/seeing-too-many-lsof-cant-identify-protocol We are pretty sure what triggered this is the spark cassandra connector which sudenly started to get stuck in early discovery of cassandra nodes before running any stages ps. we had similar issues some time ago with earlier version, of 2.1.x cassandra branch, and ended up solving the problerm by upgrading from spark1.2.1 to spark1.3.1 and also upgrading spark datastax connecor accordingly. Now looks like the problem is back with 99.9% same symptoms ps2. We have observed previously several java/cassandra unrelated processes (mainly in php-cli) go crazy with cpu then they had "can't identify protocol" symphoms > half open tcp connections to cassandra cluster nodes cause 100% cpu load > ------------------------------------------------------------------------ > > Key: CASSANDRA-9720 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9720 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Alexander Piavlo > > cassandra 2.1.5 > We spotted that few of the nodes in our cluster got sudden cpu 100% spike which never ended. It's not a GC not increased reads/writes nodes. > What we saw is that those nodes that have 100% cpu load > all have some connections (file descriptios) with "can't identify protocol" > which indicate those must be unprolery handled abrupt connections by cassandra process. > http://stackoverflow.com/questions/7911840/seeing-too-many-lsof-cant-identify-protocol > We are pretty sure what triggered this is the spark cassandra connector > which sudenly started to get stuck in early discovery of cassandra nodes before running any stages > We had to restart the affected cassandra processes to get the cpu back to normal > ps. we had similar issues some time ago with earlier version, of 2.1.x cassandra branch, and ended up solving the problerm by upgrading from spark1.2.1 to spark1.3.1 and also upgrading spark datastax connecor accordingly. Now looks like the problem is back with 99.9% same symptoms > ps2. We have observed previously several java/cassandra unrelated processes (mainly in php-cli) go crazy with cpu then they had "can't identify protocol" symphoms -- This message was sent by Atlassian JIRA (v6.3.4#6332)