cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-7307) New nodes mark dead nodes as up for 10 minutes
Date Wed, 18 Jun 2014 20:40:28 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036345#comment-14036345
] 

Brandon Williams commented on CASSANDRA-7307:
---------------------------------------------

There can be two problems here, the bigger one being a) replace will always refuse to replace
a live a node, and b) nodes that think they need to stream from that node won't realize it's
dead soon enough.  If overriding the initial value to 1s worked for you, I'm fairly certain
the problem was more a) than b), since as we discovered earlier in the ticket there's a maximum
60s window the old node can be alive before that check is triggered and bails out.


> New nodes mark dead nodes as up for 10 minutes
> ----------------------------------------------
>
>                 Key: CASSANDRA-7307
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7307
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Richard Low
>            Assignee: Brandon Williams
>             Fix For: 1.2.17, 2.0.9, 2.1 rc2
>
>
> When doing a node replacement when other nodes are down we see the down nodes marked
as up for about 10 minutes. This means requests are routed to the dead nodes causing timeouts.
It also means replacing a node when multiple nodes from a replica set is extremely difficult
- the node usually tries to stream from a dead node and the replacement fails.
> This isn't limited to host replacement. I did a simple test:
> 1. Create a 2 node cluster
> 2. Kill node 2
> 3. Start a 3rd node with a unique token (I used auto_bootstrap=false but I don't think
this is significant)
> The 3rd node lists node 2 (127.0.0.2) as up for almost 10 minutes:
> {code}
> INFO [main] 2014-05-27 14:28:24,753 CassandraDaemon.java (line 119) Logging initialized
> INFO [GossipStage:1] 2014-05-27 14:28:31,492 Gossiper.java (line 843) Node /127.0.0.2
is now part of the cluster
> INFO [GossipStage:1] 2014-05-27 14:28:31,495 Gossiper.java (line 809) InetAddress /127.0.0.2
is now UP
> INFO [GossipTasks:1] 2014-05-27 14:37:44,526 Gossiper.java (line 823) InetAddress /127.0.0.2
is now DOWN
> {code}
> I reproduced on 1.2.15 and 1.2.16.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message