cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Torra <>
Subject lots of connection timeouts around same time every day
Date Thu, 16 Feb 2017 15:52:36 GMT
Hi  there -

Cluster info:
C* 3.9, replicated across 4 EC2 regions (us-east-1, us-west-2, eu-west-1,
ap-southeast-1), c4.4xlarge

Around the same time every day (~7-8am EST), 2 DC's (eu-west-1 and
ap-southeast-1) in our cluster start experiencing a high number of timeouts
(Connection.TotalTimeouts metric). The issue seems to occur equally on all
nodes in the impacted DC. I'm trying to track down exactly what is timing
out, and what is causing it to happen.

With debug logs, I can see many messages like this:

DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 -
Convicting /xx.xx.xx.xx with status NORMAL - alive false

DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 -
Convicting /xx.xx.xx.xx with status removed - alive false

DEBUG [GossipTasks:1] 2017-02-16 15:39:42,274 -
Convicting /xx.xx.xx.xx with status shutdown - alive false

The 'status removed' node I `nodetool remove`'d from the cluster, so I'm
not sure why that appears. The node mentioned in the 'status NORMAL' line
has constant warnings like this:

WARN  [GossipTasks:1] 2017-02-16 15:40:02,845 - Gossip
stage has 453589 pending tasks; skipping status check (no nodes will be
marked down)

These lines seem to go away after restarting that node, and on the original
node, the 'Convicting' lines go away as well. However, the timeout counts
do not seem to change. Why does restarting the node seem to fix gossip
falling behind?

There are also a lot of debug log messages like this:

DEBUG [GossipStage:1] 2017-02-16 15:45:04,849 -
Ignoring interval time of 2355580769 for /xx.xx.xx.xx

Could these be related to the high number of timeouts I see? I've also
tried increasing the value of phi_convict_threshold to 12, as suggested
This does not seem to have changed anything on the nodes that I've changed
it on.

I appreciate any suggestions on what else to try in order to track down
these timeouts.

- Mike

View raw message