Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7901810974 for ; Tue, 30 Dec 2014 19:37:13 +0000 (UTC) Received: (qmail 57678 invoked by uid 500); 30 Dec 2014 19:37:13 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 57635 invoked by uid 500); 30 Dec 2014 19:37:13 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 57623 invoked by uid 99); 30 Dec 2014 19:37:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Dec 2014 19:37:13 +0000 Date: Tue, 30 Dec 2014 19:37:13 +0000 (UTC) From: "Donald Smith (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-8245) Cassandra nodes periodically die in 2-DC configuration MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261238#comment-14261238 ] Donald Smith edited comment on CASSANDRA-8245 at 12/30/14 7:36 PM: ------------------------------------------------------------------- We're getting a similar increase in the number of pending Gossip stage tasks, followed by OutOfMemory. This happens once a day or so on some node of our 38 node DC. Other nodes have increases in pending Gossip stage tasks but they recover. This is with C* 2.0.11. We have two other DCs. ntpd is running on all nodes. But all nodes on one DC are down now. What's odd is that the cassandra process continues running despite the OutOfMemory exception. You'd expect it to exit. Prior to getting OutOfMemory, I notice that such nodes are slow in responding to commands and queries (e.g., jmx). {noformat} WARN [GossipTasks:1] 2014-12-26 02:45:06,204 Gossiper.java (line 648) Gossip stage has 2695 pending tasks; skipping status check (no nodes will be marked down) ERROR [Thread-49234] 2014-12-26 07:18:42,281 CassandraDaemon.java (line 199) Exception in thread Thread[Thread-49234,5,main] java.lang.OutOfMemoryError: Java heap space .... ERROR [Thread-49235] 2014-12-26 07:18:42,291 CassandraDaemon.java (line 199) Exception in thread Thread[Thread-49235,5,main] java.lang.OutOfMemoryError: Java heap space ... {noformat} was (Author: thinkerfeeler): We're getting a similar increase in the number of pending Gossip stage tasks, followed by OutOfMemory. This happens once a day or so on some node of our 38 node DC. Other nodes have increases in pending Gossip stage tasks but they recover. This is with C* 2.0.11. We have two other DCs. ntpd is running on all nodes. But all nodes on one DC are down now. What's odd is that the cassandra process continues running despite the OutOfMemory exception. You'd expect it to exit. {noformat} WARN [GossipTasks:1] 2014-12-26 02:45:06,204 Gossiper.java (line 648) Gossip stage has 2695 pending tasks; skipping status check (no nodes will be marked down) ERROR [Thread-49234] 2014-12-26 07:18:42,281 CassandraDaemon.java (line 199) Exception in thread Thread[Thread-49234,5,main] java.lang.OutOfMemoryError: Java heap space .... ERROR [Thread-49235] 2014-12-26 07:18:42,291 CassandraDaemon.java (line 199) Exception in thread Thread[Thread-49235,5,main] java.lang.OutOfMemoryError: Java heap space ... {noformat} > Cassandra nodes periodically die in 2-DC configuration > ------------------------------------------------------ > > Key: CASSANDRA-8245 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8245 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Scientific Linux release 6.5 > java version "1.7.0_51" > Cassandra 2.0.9 > Reporter: Oleg Poleshuk > Assignee: Brandon Williams > Priority: Minor > Attachments: stack1.txt, stack2.txt, stack3.txt, stack4.txt, stack5.txt > > > We have 2 DCs with 3 nodes in each. > Second DC periodically has 1-2 nodes down. > Looks like it looses connectivity with another nodes and then Gossiper starts to accumulate tasks until Cassandra dies with OOM. > WARN [MemoryMeter:1] 2014-08-12 14:34:59,803 Memtable.java (line 470) setting live ratio to maximum of 64.0 instead of Infinity > WARN [GossipTasks:1] 2014-08-12 14:44:34,866 Gossiper.java (line 637) Gossip stage has 1 pending tasks; skipping status check (no nodes will be marked down) > WARN [GossipTasks:1] 2014-08-12 14:44:35,968 Gossiper.java (line 637) Gossip stage has 4 pending tasks; skipping status check (no nodes will be marked down) > WARN [GossipTasks:1] 2014-08-12 14:44:37,070 Gossiper.java (line 637) Gossip stage has 8 pending tasks; skipping status check (no nodes will be marked down) > WARN [GossipTasks:1] 2014-08-12 14:44:38,171 Gossiper.java (line 637) Gossip stage has 11 pending tasks; skipping status check (no nodes will be marked down) > ... > WARN [GossipTasks:1] 2014-10-06 21:42:51,575 Gossiper.java (line 637) Gossip stage has 1014764 pending tasks; skipping status check (no nodes will be marked down) > WARN [New I/O worker #13] 2014-10-06 21:54:27,010 Slf4JLogger.java (line 76) Unexpected exception in the selector loop. > java.lang.OutOfMemoryError: Java heap space > Also those lines but not sure it is relevant: > DEBUG [GossipStage:1] 2014-08-12 11:33:18,801 FailureDetector.java (line 338) Ignoring interval time of 2085963047 -- This message was sent by Atlassian JIRA (v6.3.4#6332)