cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9765) checkForEndpointCollision fails for legitimate collisions
Date Wed, 15 Jul 2015 08:56:04 GMT


Stefania commented on CASSANDRA-9765:

Thank you for your in-dept analysis Jason. If I understood correctly, this method should be
called in {{checkForEndpointCollision}} as follows and replace all other conditions:

if (!Gossiper.instance.isSafeForBootstrap(FBUtilities.getBroadcastAddress()))

I think it makes sense to look at the gossip states rather than {{TMD.isMember()}}. I am still
unclear however on the empty, BOOTSTRAP and SHUTDOWN states. In your method the first is not
allowed and the last two are both allowed.
According to CASSANDRA-7939 BOOTSTRAP should be allowed but not SHUTDOWN? 

If you clarify which states are OK to join the cluster I can rewrite the method, here is what
I have so far:

public boolean isSafeForBootstrap(InetAddress endpoint)
        EndpointState epState = endpointStateMap.get(endpoint);
        // if there's no previous state, or the node was previously removed from the cluster,
we're good
        if (epState == null || isDeadState(epState))
            return true;

       String state = getApplicationState(epState);

        // these states are not allowed to join the cluster
        List<String> states = new ArrayList<String>() {{
            add(""); // failed bootstrap but we did start gossiping <== is this allowed
or NOT? (CASSANDRA-7939)
            add(VersionedValue.STATUS_NORMAL); // node is legit in the cluster or it was stopped
with kill -9 <== definitely not allowed
            add(VersionedValue.STATUS_BOOTSTRAPPING); // failed bootstrap (e.g. kill -9 during
bootstrap) <== is this allowed or NOT? (CASSANDRA-7939)
            add(VersionedValue.SHUTDOWN); }}; // node was shutdown but not removed <==
not allowed also, right?
        return !states.contains(state);

As for the tests, I was able to add 3 more tests so right now we have the following states:

* SHUTDOWN - node is shutdown gracefully, test name : _shutdown_wiped_node_cannot_join_test_
* NORMAL - node is shutdown with kill -9, test name : _killed_wiped_node_cannot_join_test_
* LEFT - node is decommissioned, test name : _decommissioned_wiped_node_can_join_test_
* BOOT - node is shutdown with kill -9 during bootstrap, test name : _failed_bootstap_wiped_node_can_join_test_

I think these are the essential tests, the only one missing is perhaps killing the node after
it starts gossiping and before it enters the bootstrap, but I am not sure how easy it is as
the python code must reliably kill the node at the right time (with the bootstrap I was able
to extend it by increasing the amount of data streamed). 

> checkForEndpointCollision fails for legitimate collisions
> ---------------------------------------------------------
>                 Key: CASSANDRA-9765
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Richard Low
>            Assignee: Stefania
>             Fix For: 2.0.17
> Since CASSANDRA-7939, checkForEndpointCollision no longer catches a legitimate collision.
Without CASSANDRA-7939, wiping a node and starting it again fails with 'A node with address
%s already exists', but with it the node happily enters joining state, potentially streaming
from the wrong place and violating consistency.

This message was sent by Atlassian JIRA

View raw message