Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 67B9710898 for ; Tue, 27 May 2014 23:10:02 +0000 (UTC) Received: (qmail 35123 invoked by uid 500); 27 May 2014 23:10:02 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 35085 invoked by uid 500); 27 May 2014 23:10:02 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 35076 invoked by uid 99); 27 May 2014 23:10:02 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 May 2014 23:10:02 +0000 Date: Tue, 27 May 2014 23:10:02 +0000 (UTC) From: "Richard Low (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-7307) New nodes mark dead nodes as up for 10 minutes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010478#comment-14010478 ] Richard Low commented on CASSANDRA-7307: ---------------------------------------- The 'Cannnot (sic) replace a live node' error came about 1 minute after boot, even with a 5 minute RING_DELAY. So I don't think a higher RING_DELAY will work: INFO [main] 2014-05-23 19:51:16,934 CassandraDaemon.java (line 119) Logging initialized INFO [main] 2014-05-23 19:51:20,038 StorageService.java (line 105) Overriding RING_DELAY to 300000ms ERROR [main] 2014-05-23 19:52:25,189 CassandraDaemon.java (line 464) Exception encountered during startup java.lang.UnsupportedOperationException: Cannnot replace a live node... I was surprised by this, I expected it to wait for RING_DELAY before getting host replacement info. Is this expected behaviour? > New nodes mark dead nodes as up for 10 minutes > ---------------------------------------------- > > Key: CASSANDRA-7307 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7307 > Project: Cassandra > Issue Type: Bug > Reporter: Richard Low > Assignee: Brandon Williams > Fix For: 1.2.17 > > > When doing a node replacement when other nodes are down we see the down nodes marked as up for about 10 minutes. This means requests are routed to the dead nodes causing timeouts. It also means replacing a node when multiple nodes from a replica set is extremely difficult - the node usually tries to stream from a dead node and the replacement fails. > This isn't limited to host replacement. I did a simple test: > 1. Create a 2 node cluster > 2. Kill node 2 > 3. Start a 3rd node with a unique token (I used auto_bootstrap=false but I don't think this is significant) > The 3rd node lists node 2 (127.0.0.2) as up for almost 10 minutes: > {code} > INFO [main] 2014-05-27 14:28:24,753 CassandraDaemon.java (line 119) Logging initialized > INFO [GossipStage:1] 2014-05-27 14:28:31,492 Gossiper.java (line 843) Node /127.0.0.2 is now part of the cluster > INFO [GossipStage:1] 2014-05-27 14:28:31,495 Gossiper.java (line 809) InetAddress /127.0.0.2 is now UP > INFO [GossipTasks:1] 2014-05-27 14:37:44,526 Gossiper.java (line 823) InetAddress /127.0.0.2 is now DOWN > {code} > I reproduced on 1.2.15 and 1.2.16. -- This message was sent by Atlassian JIRA (v6.2#6252)