cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Jirsa <jeff.ji...@crowdstrike.com>
Subject Re: Ring connection timeouts with 2.2.6
Date Tue, 05 Jul 2016 03:16:24 GMT
AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver – depending on your instance
types / hypervisor choice, you may want to ensure you’re not seeing that bug.

 

From: Mike Heffner <mike@librato.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Friday, July 1, 2016 at 1:10 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Cc: Peter Norton <pcn@librato.com>
Subject: Re: Ring connection timeouts with 2.2.6

 

Jens, 

 

We haven't noticed any particular large GC operations or even persistently high GC times.

 

Mike

 

On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.rantil@tink.se> wrote:

Hi,

Could it be garbage collection occurring on nodes that are more heavily loaded?

Cheers,
Jens

 

Den sön 26 juni 2016 05:22Mike Heffner <mike@librato.com> skrev:

One thing to add, if we do a rolling restart of the ring the timeouts disappear entirely for
several hours and performance returns to normal. It's as if something is leaking over time,
but we haven't seen any noticeable change in heap.

 

On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mike@librato.com> wrote:

Hi, 

 

We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is sitting at <25%
CPU, doing mostly writes, and not showing any particular long GC times/pauses. By all observed
metrics the ring is healthy and performing well.

 

However, we are noticing a pretty consistent number of connection timeouts coming from the
messaging service between various pairs of nodes in the ring. The "Connection.TotalTimeouts"
meter metric show 100k's of timeouts per minute, usually between two pairs of nodes for several
hours at a time. It seems to occur for several hours at a time, then may stop or move to other
pairs of nodes in the ring. The metric "Connection.SmallMessageDroppedTasks.<ip>" will
also grow for one pair of the nodes in the TotalTimeouts metric.

 

Looking at the debug log typically shows a large number of messages like the following on
one of the nodes:

 

StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)

We have cross node timeouts enabled, but ntp is running on all nodes and no node appears to
have time drift.

 

The network appears to be fine between nodes, with iperf tests showing that we have a lot
of headroom.

 

Any thoughts on what to look for? Can we increase thread count/pool sizes for the messaging
service?

 

Thanks,

 

Mike

 

-- 


  Mike Heffner <mike@librato.com>

  Librato, Inc.

 



 

-- 


  Mike Heffner <mike@librato.com>

  Librato, Inc.

 

-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.



 

-- 


  Mike Heffner <mike@librato.com>

  Librato, Inc.

 


Mime
View raw message