Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7274A10753 for ; Thu, 26 Feb 2015 19:45:13 +0000 (UTC) Received: (qmail 36308 invoked by uid 500); 26 Feb 2015 19:45:06 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 36209 invoked by uid 500); 26 Feb 2015 19:45:06 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 36103 invoked by uid 99); 26 Feb 2015 19:45:06 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Feb 2015 19:45:06 +0000 Date: Thu, 26 Feb 2015 19:45:06 +0000 (UTC) From: "Brandon Williams (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8336?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 14339022#comment-14339022 ]=20 Brandon Williams commented on CASSANDRA-8336: --------------------------------------------- bq. If hit 'Unable to gossip with any seeds=E2=80=99 on replace, it shuts d= own the gossiper. Do you have the stacktrace where this is happening? I have a feeling we're= going to end up in checked exception hell trying to fix this since we thro= w RuntimeException there (to avoid such a hell, in fact.) > Quarantine nodes after receiving the gossip shutdown message > ------------------------------------------------------------ > > Key: CASSANDRA-8336 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8336 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Brandon Williams > Assignee: Brandon Williams > Fix For: 2.0.13 > > Attachments: 8336-v2.txt, 8336-v3.txt, 8336.txt > > > In CASSANDRA-3936 we added a gossip shutdown announcement. The problem h= ere is that this isn't sufficient; you can still get TOEs and have to wait = on the FD to figure things out. This happens due to gossip propagation tim= e and variance; if node X shuts down and sends the message to Y, but Z has = a greater gossip version than Y for X and has not yet received the message,= it can initiate gossip with Y and thus mark X alive again. I propose quar= antining to solve this, however I feel it should be a -D parameter you have= to specify, so as not to destroy current dev and test practices, since thi= s will mean a node that shuts down will not be able to restart until the qu= arantine expires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)