Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0BEB417DE1 for ; Tue, 20 Jan 2015 21:27:37 +0000 (UTC) Received: (qmail 59254 invoked by uid 500); 20 Jan 2015 21:27:36 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 59211 invoked by uid 500); 20 Jan 2015 21:27:36 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 59198 invoked by uid 99); 20 Jan 2015 21:27:36 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jan 2015 21:27:36 +0000 Date: Tue, 20 Jan 2015 21:27:36 +0000 (UTC) From: "Brandon Williams (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-8336: ---------------------------------------- Attachment: 8336.txt Patch to take the hibernate approach, but use a new STATUS VV called SHUTDOWN. One reason we need this (not just for operator clarity) is that we have to special case it for Gossiper.convict, since if it receives our shutdown state passively over gossip before it receives it actively over RPC it will be in a dead state and left untouched. We can also race the other way (rpc before passive) but the isAlive check prevents us from doubly marking it down. In a mixed cluster, this still preserves the old behavior, since they won't know what 'shutdown' means and just rely on the active rpc method to mark the node down. > Quarantine nodes after receiving the gossip shutdown message > ------------------------------------------------------------ > > Key: CASSANDRA-8336 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8336 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Brandon Williams > Assignee: Brandon Williams > Fix For: 2.0.13 > > Attachments: 8336.txt > > > In CASSANDRA-3936 we added a gossip shutdown announcement. The problem here is that this isn't sufficient; you can still get TOEs and have to wait on the FD to figure things out. This happens due to gossip propagation time and variance; if node X shuts down and sends the message to Y, but Z has a greater gossip version than Y for X and has not yet received the message, it can initiate gossip with Y and thus mark X alive again. I propose quarantining to solve this, however I feel it should be a -D parameter you have to specify, so as not to destroy current dev and test practices, since this will mean a node that shuts down will not be able to restart until the quarantine expires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)