cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (CASSANDRA-9218) Node thinks other nodes are down after heavy GC
Date Tue, 21 Apr 2015 13:06:59 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jonathan Ellis resolved CASSANDRA-9218.
---------------------------------------
    Resolution: Duplicate

> Node thinks other nodes are down after heavy GC
> -----------------------------------------------
>
>                 Key: CASSANDRA-9218
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9218
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Erik Forsberg
>
> I have a few troublesome nodes which often end up doing very long GC pauses. The root
cause of this is yet to be found, but it's causing another problem - the affected node(s)
mark other nodes as down, and they never recover.
> Here's how it goes:
> 1. Node goes into troublesome mode, doing heavy GC with long (10+ seconds) GC pauses.
> 2. While this happens, node will mark other nodes as down.
> 3. Once the overload situation resolves, the node still thinks the other nodes are down
(they are not). It's also quite common that other nodes think the affected node is down.
> So we often end up with node A thinking there's some 30 nodes down, then a bunch of other
nodes beliving node A is down. This in a cluster with 56 nodes. 
> The only way to get out of the situation is to restart node A, and sometimes a few other
nodes. And while node A is in this state, any queries that use node A as coordinator have
a high risk of getting errors about not enough replicas being available. 
> I have enabled TRACE level gossip debugging while this happens, and on node A, there
will be multiple messages about, "has already a pending echo, skipping it" - i.e the debug
line in Gossiper.java line 882.
> I have also observed while this was happening that other nodes were trying to establish
connections (SYN packets sent) but the trouble node (A) were not picking up the line (no accept()).
> Not knowing exactly how Gossiper works here but it looks like node A is sending out some
gossiper echo messages, but then is too busy to get the replies, and never retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message