cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sharvanath Pathak (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-10052) Bringing one node down, makes the whole cluster go down for a second
Date Wed, 12 Aug 2015 17:37:45 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693892#comment-14693892
] 

Sharvanath Pathak edited comment on CASSANDRA-10052 at 8/12/15 5:36 PM:
------------------------------------------------------------------------

The clients are only connected to the localhost, but this event is coming even if some other
node fails. I think not sending these uninteresting events could be a right fix, but right
now it seems like it is upto the customer to ignore them. Moreover, in this case the argument
of the event is localhost, so the client can't do anything.

Disclaimer: I'm using 2.0.9, but on a quick inspection it seems that the code for this part
has not changed since.


was (Author: sharvanath):
The clients are only connected to the localhost, but this event is coming even if some other
node fails. I think not sending these uninteresting events could be a right fix, but right
now I think it is upto the customer to ignore them. Moreover, in this case the argument of
the event is localhost, so the client can't do anything.

Disclaimer: I'm using 2.0.9, but on a quick inspection it seems that the code for this part
has not changed since.

> Bringing one node down, makes the whole cluster go down for a second
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-10052
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10052
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sharvanath Pathak
>            Priority: Critical
>
> When a node goes down, the other nodes learn that through the gossip.
> And I do see the log from (Gossiper.java):
> {code}
> private void markDead(InetAddress addr, EndpointState localState)
>    {
>        if (logger.isTraceEnabled())
>            logger.trace("marking as down {}", addr);
>        localState.markDead();
>        liveEndpoints.remove(addr);
>        unreachableEndpoints.put(addr, System.nanoTime());
>        logger.info("InetAddress {} is now DOWN", addr);
>        for (IEndpointStateChangeSubscriber subscriber : subscribers)
>            subscriber.onDead(addr, localState);
>        if (logger.isTraceEnabled())
>            logger.trace("Notified " + subscribers);
>    }
> {code}
> Saying: "InetAddress 192.168.101.1 is now Down", in the Cassandra's system log.
> Now on all the other nodes the client side (java driver) says, " Cannot connect to any
host, scheduling retry in 1000 milliseconds". They eventually do reconnect but some queries
fail during this intermediate period.
> To me it seems like when the server pushes the nodeDown event, it call the getRpcAddress(endpoint),
and thus sends localhost as the argument in the nodeDown event.  
> As in org.apache.cassandra.transport.Server.java
> {code}
>   public void onDown(InetAddress endpoint)
>        {      
>            server.connectionTracker.send(Event.StatusChange.nodeDown(getRpcAddress(endpoint),
server.socket.getPort()));
>        }
> {code}
> the getRpcAddress returns localhost for any endpoint if the cassandra.yaml is using localhost
as the configuration for rpc_address (which by the way is the default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message