cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olivier Michallat (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10052) Bringing one node down, makes the whole cluster go down for a second
Date Thu, 13 Aug 2015 10:12:46 GMT


Olivier Michallat commented on CASSANDRA-10052:

Yes, I agree with that fix.
It's safe to not send the notification, because the only client that would be interested in
it has lost its control connection anyway. It will find out by itself when the connections
get closed.
We should even extend that to other notifications, otherwise the client will get "fake" ADD,
UP or REMOVED events.

That's an interesting setup because drivers use rpc_address to uniquely identify nodes, but
here all nodes use the same address so each client thinks there is only one node.

> Bringing one node down, makes the whole cluster go down for a second
> --------------------------------------------------------------------
>                 Key: CASSANDRA-10052
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sharvanath Pathak
>            Assignee: Stefania
>            Priority: Critical
> When a node goes down, the other nodes learn that through the gossip.
> And I do see the log from (
> {code}
> private void markDead(InetAddress addr, EndpointState localState)
>    {
>        if (logger.isTraceEnabled())
>            logger.trace("marking as down {}", addr);
>        localState.markDead();
>        liveEndpoints.remove(addr);
>        unreachableEndpoints.put(addr, System.nanoTime());
>"InetAddress {} is now DOWN", addr);
>        for (IEndpointStateChangeSubscriber subscriber : subscribers)
>            subscriber.onDead(addr, localState);
>        if (logger.isTraceEnabled())
>            logger.trace("Notified " + subscribers);
>    }
> {code}
> Saying: "InetAddress is now Down", in the Cassandra's system log.
> Now on all the other nodes the client side (java driver) says, " Cannot connect to any
host, scheduling retry in 1000 milliseconds". They eventually do reconnect but some queries
fail during this intermediate period.
> To me it seems like when the server pushes the nodeDown event, it call the getRpcAddress(endpoint),
and thus sends localhost as the argument in the nodeDown event.  
> As in
> {code}
>   public void onDown(InetAddress endpoint)
>        {      
>            server.connectionTracker.send(Event.StatusChange.nodeDown(getRpcAddress(endpoint),
>        }
> {code}
> the getRpcAddress returns localhost for any endpoint if the cassandra.yaml is using localhost
as the configuration for rpc_address (which by the way is the default).

This message was sent by Atlassian JIRA

View raw message