cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-5665) Gossiper.handleMajorStateChange can lose existing node ApplicationState
Date Wed, 19 Jun 2013 18:00:33 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Brown updated CASSANDRA-5665:
-----------------------------------

    Attachment: 5665-v1.diff

The attached patch modifies Gossiper.handleMajorStateChanged by checking if the the endpoint
already exists in the endpointStateMap, and adds any previous ApplicationState fields to the
new epState 
if a) the AppState does not exist in the new epState struct or b) has AppState whose version
is greater than that in the epState.

One the surface the patch is straight forward, but I'm not sure if there's some subtle bugs
that might creep in with retaining previous state (although that state might get replaced
anyways in a very short time). Thus, while the patch fixes 'my problem', I'm not sure if this
is the safest way to resolve.

                
> Gossiper.handleMajorStateChange can lose existing node ApplicationState
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-5665
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5665
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.5
>            Reporter: Jason Brown
>            Priority: Minor
>              Labels: gossip, upgrade
>             Fix For: 1.2.6, 2.0 beta 1
>
>         Attachments: 5665-v1.diff
>
>
> Dovetailing on #5660, I discovered that further along during an upgrade, when more nodes
are on the new major version, a node the previous version can get passed some incomplete Gossip
info about another, already upgraded node, and the older node drops AppStat info about that
node.
> I think what happens is that a 1.1 node (older rev) gets gossip info from a 1.2 node
(A), which includes incomplete (lacking some AppState data) gossip info about another 1.2
node (B). The 1.1 node, which has marked incorrectly kicked node B out of gossip due to the
bug described in #5660, then takes that incomplete node B info and wholesale replaces any
previous known state about node B in Gossiper.handleMajorStateChanged. Thus, if we previously
had DC/RACK info, it'll get dropped as part of the endpointStateMap.put(endpointstate). When
the data being pased is incomplete, 1.1 will start referencing node B and gets into the NPE
situation in #5498.
> Anecdotally, this bad state is short-lived, less than a few minutes, even as short as
ten seconds, until gossip catches up and properly propagates the AppState data. Furthermore,
when upgrading a two datacenter, 48 node cluster, it only occurred on two nodes for less than
a minute each. Thus, the scope seems limited but can occur.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message