cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-5665) Gossiper.handleMajorStateChange can lose existing node ApplicationState
Date Wed, 19 Jun 2013 19:13:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13688331#comment-13688331
] 

Jason Brown edited comment on CASSANDRA-5665 at 6/19/13 7:12 PM:
-----------------------------------------------------------------

I suspect it's just something we haven't noticed before. I found it when upgrading from 1.0
to 1.1, largely via the CASSANDRA-5660 debugging. It was exposed as an edge case of an edge
case (using each_qourum, in ec2, during a major rev upgrade), however it probably applies
to many upgrade scenarios.
                
      was (Author: jasobrown):
    I suspect it's just something we haven't noticed before. I found it when upgrading from
1.0 to 1.1, largely via the CASSANDRA-5660 debugging. It was exposed as an edge case of an
edge case (using each_qourum, in ec2, during a major rev upgrade), however it probably applies
to every upgrade scenario.
                  
> Gossiper.handleMajorStateChange can lose existing node ApplicationState
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-5665
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5665
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.5
>            Reporter: Jason Brown
>            Priority: Minor
>              Labels: gossip, upgrade
>             Fix For: 1.2.6, 2.0 beta 1
>
>         Attachments: 5665-v1.diff
>
>
> Dovetailing on CASSANDRA-5660, I discovered that further along during an upgrade, when
more nodes are on the new major version, a node the previous version can get passed some incomplete
Gossip info about another, already upgraded node, and the older node drops AppStat info about
that node.
> I think what happens is that a 1.1 node (older rev) gets gossip info from a 1.2 node
(A), which includes incomplete (lacking some AppState data) gossip info about another 1.2
node (B). The 1.1 node, which has marked incorrectly kicked node B out of gossip due to the
bug described in #5660, then takes that incomplete node B info and wholesale replaces any
previous known state about node B in Gossiper.handleMajorStateChanged. Thus, if we previously
had DC/RACK info, it'll get dropped as part of the endpointStateMap.put(endpointstate). When
the data being pased is incomplete, 1.1 will start referencing node B and gets into the NPE
situation in #5498.
> Anecdotally, this bad state is short-lived, less than a few minutes, even as short as
ten seconds, until gossip catches up and properly propagates the AppState data. Furthermore,
when upgrading a two datacenter, 48 node cluster, it only occurred on two nodes for less than
a minute each. Thus, the scope seems limited but can occur.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message