Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DBFEB1006B for ; Sat, 22 Jun 2013 19:16:20 +0000 (UTC) Received: (qmail 66285 invoked by uid 500); 22 Jun 2013 19:16:20 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 66263 invoked by uid 500); 22 Jun 2013 19:16:20 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 66252 invoked by uid 99); 22 Jun 2013 19:16:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Jun 2013 19:16:20 +0000 Date: Sat, 22 Jun 2013 19:16:20 +0000 (UTC) From: "Dave Brosius (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-5665) Gossiper.handleMajorStateChange can lose existing node ApplicationState MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691224#comment-13691224 ] Dave Brosius commented on CASSANDRA-5665: ----------------------------------------- Map merged = new HashMap(); in Gossiper.copyNewerApplicationStates could be an EnumMap > Gossiper.handleMajorStateChange can lose existing node ApplicationState > ----------------------------------------------------------------------- > > Key: CASSANDRA-5665 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5665 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 1.2.5 > Reporter: Jason Brown > Assignee: Jason Brown > Priority: Minor > Labels: gossip, upgrade > Fix For: 1.2.6, 2.0 beta 1 > > Attachments: 5665-v1.diff, 5665-v2.diff > > > Dovetailing on CASSANDRA-5660, I discovered that further along during an upgrade, when more nodes are on the new major version, a node the previous version can get passed some incomplete Gossip info about another, already upgraded node, and the older node drops AppStat info about that node. > I think what happens is that a 1.1 node (older rev) gets gossip info from a 1.2 node (A), which includes incomplete (lacking some AppState data) gossip info about another 1.2 node (B). The 1.1 node, which has marked incorrectly kicked node B out of gossip due to the bug described in #5660, then takes that incomplete node B info and wholesale replaces any previous known state about node B in Gossiper.handleMajorStateChanged. Thus, if we previously had DC/RACK info, it'll get dropped as part of the endpointStateMap.put(endpointstate). When the data being pased is incomplete, 1.1 will start referencing node B and gets into the NPE situation in #5498. > Anecdotally, this bad state is short-lived, less than a few minutes, even as short as ten seconds, until gossip catches up and properly propagates the AppState data. Furthermore, when upgrading a two datacenter, 48 node cluster, it only occurred on two nodes for less than a minute each. Thus, the scope seems limited but can occur. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira