cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Heffner <m...@librato.com>
Subject Re: Ring connection timeouts with 2.2.6
Date Fri, 01 Jul 2016 20:10:12 GMT
Jens,

We haven't noticed any particular large GC operations or even persistently
high GC times.

Mike

On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.rantil@tink.se> wrote:

> Hi,
>
> Could it be garbage collection occurring on nodes that are more heavily
> loaded?
>
> Cheers,
> Jens
>
> Den sön 26 juni 2016 05:22Mike Heffner <mike@librato.com> skrev:
>
>> One thing to add, if we do a rolling restart of the ring the timeouts
>> disappear entirely for several hours and performance returns to normal.
>> It's as if something is leaking over time, but we haven't seen any
>> noticeable change in heap.
>>
>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mike@librato.com> wrote:
>>
>>> Hi,
>>>
>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that
>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular
>>> long GC times/pauses. By all observed metrics the ring is healthy and
>>> performing well.
>>>
>>> However, we are noticing a pretty consistent number of connection
>>> timeouts coming from the messaging service between various pairs of nodes
>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>>> timeouts per minute, usually between two pairs of nodes for several hours
>>> at a time. It seems to occur for several hours at a time, then may stop or
>>> move to other pairs of nodes in the ring. The metric
>>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair
of
>>> the nodes in the TotalTimeouts metric.
>>>
>>> Looking at the debug log typically shows a large number of messages like
>>> the following on one of the nodes:
>>>
>>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)
>>>
>>> We have cross node timeouts enabled, but ntp is running on all nodes and
>>> no node appears to have time drift.
>>>
>>> The network appears to be fine between nodes, with iperf tests showing
>>> that we have a lot of headroom.
>>>
>>> Any thoughts on what to look for? Can we increase thread count/pool
>>> sizes for the messaging service?
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>> --
>>>
>>>   Mike Heffner <mike@librato.com>
>>>   Librato, Inc.
>>>
>>>
>>
>>
>> --
>>
>>   Mike Heffner <mike@librato.com>
>>   Librato, Inc.
>>
>> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>



-- 

  Mike Heffner <mike@librato.com>
  Librato, Inc.

Mime
View raw message