cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Heffner <m...@librato.com>
Subject Re: Ring connection timeouts with 2.2.6
Date Sat, 23 Jul 2016 14:18:05 GMT
Garo,

No, we didn't notice any change in system load, just the expected spike in
packet counts.

Mike

On Wed, Jul 20, 2016 at 3:49 PM, Juho Mäkinen <juho.makinen@gmail.com>
wrote:

> Just to pick this up: Did you see any system load spikes? I'm tracing a
> problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the
> normal average load is around 3-4. So far I haven't found any good reason,
> but I'm going to try otc_coalescing_strategy: disabled tomorrow.
>
>  - Garo
>
> On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner <mike@librato.com> wrote:
>
>> Just to followup on this post with a couple of more data points:
>>
>> 1)
>>
>> We upgraded to 2.2.7 and did not see any change in behavior.
>>
>> 2)
>>
>> However, what *has* fixed this issue for us was disabling msg coalescing
>> by setting:
>>
>> otc_coalescing_strategy: DISABLED
>>
>> We were using the default setting before (time horizon I believe).
>>
>> We see periodic timeouts on the ring (once every few hours), but they are
>> brief and don't impact latency. With msg coalescing turned on we would see
>> these timeouts persist consistently after an initial spike. My guess is
>> that something in the coalescing logic is disturbed by the initial timeout
>> spike which leads to dropping all / high-percentage of all subsequent
>> traffic.
>>
>> We are planning to continue production use with msg coaleasing disabled
>> for now and may run tests in our staging environments to identify where the
>> coalescing is breaking this.
>>
>> Mike
>>
>> On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <mike@librato.com> wrote:
>>
>>> Jeff,
>>>
>>> Thanks, yeah we updated to the 2.16.4 driver version from source. I
>>> don't believe we've hit the bugs mentioned in earlier driver versions.
>>>
>>> Mike
>>>
>>> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.jirsa@crowdstrike.com>
>>> wrote:
>>>
>>>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
>>>> depending on your instance types / hypervisor choice, you may want to
>>>> ensure you’re not seeing that bug.
>>>>
>>>>
>>>>
>>>> *From: *Mike Heffner <mike@librato.com>
>>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> *Date: *Friday, July 1, 2016 at 1:10 PM
>>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> *Cc: *Peter Norton <pcn@librato.com>
>>>> *Subject: *Re: Ring connection timeouts with 2.2.6
>>>>
>>>>
>>>>
>>>> Jens,
>>>>
>>>>
>>>>
>>>> We haven't noticed any particular large GC operations or even
>>>> persistently high GC times.
>>>>
>>>>
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.rantil@tink.se>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Could it be garbage collection occurring on nodes that are more heavily
>>>> loaded?
>>>>
>>>> Cheers,
>>>> Jens
>>>>
>>>>
>>>>
>>>> Den sön 26 juni 2016 05:22Mike Heffner <mike@librato.com> skrev:
>>>>
>>>> One thing to add, if we do a rolling restart of the ring the timeouts
>>>> disappear entirely for several hours and performance returns to normal.
>>>> It's as if something is leaking over time, but we haven't seen any
>>>> noticeable change in heap.
>>>>
>>>>
>>>>
>>>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mike@librato.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that
>>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular
>>>> long GC times/pauses. By all observed metrics the ring is healthy and
>>>> performing well.
>>>>
>>>>
>>>>
>>>> However, we are noticing a pretty consistent number of connection
>>>> timeouts coming from the messaging service between various pairs of nodes
>>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>>>> timeouts per minute, usually between two pairs of nodes for several hours
>>>> at a time. It seems to occur for several hours at a time, then may stop or
>>>> move to other pairs of nodes in the ring. The metric
>>>> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair
of
>>>> the nodes in the TotalTimeouts metric.
>>>>
>>>>
>>>>
>>>> Looking at the debug log typically shows a large number of messages
>>>> like the following on one of the nodes:
>>>>
>>>>
>>>>
>>>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=>
>>>> (ttl 0)
>>>>
>>>> We have cross node timeouts enabled, but ntp is running on all nodes
>>>> and no node appears to have time drift.
>>>>
>>>>
>>>>
>>>> The network appears to be fine between nodes, with iperf tests showing
>>>> that we have a lot of headroom.
>>>>
>>>>
>>>>
>>>> Any thoughts on what to look for? Can we increase thread count/pool
>>>> sizes for the messaging service?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>   Mike Heffner <mike@librato.com>
>>>>
>>>>   Librato, Inc.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>   Mike Heffner <mike@librato.com>
>>>>
>>>>   Librato, Inc.
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Jens Rantil
>>>> Backend Developer @ Tink
>>>>
>>>> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
>>>> For urgent matters you can reach me at +46-708-84 18 32.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>   Mike Heffner <mike@librato.com>
>>>>
>>>>   Librato, Inc.
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>   Mike Heffner <mike@librato.com>
>>>   Librato, Inc.
>>>
>>>
>>
>>
>> --
>>
>>   Mike Heffner <mike@librato.com>
>>   Librato, Inc.
>>
>>
>


-- 

  Mike Heffner <mike@librato.com>
  Librato, Inc.

Mime
View raw message