cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Heffner <m...@librato.com>
Subject Re: Ring connection timeouts with 2.2.6
Date Tue, 05 Jul 2016 16:14:41 GMT
Jeff,

Thanks, yeah we updated to the 2.16.4 driver version from source. I don't
believe we've hit the bugs mentioned in earlier driver versions.

Mike

On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.jirsa@crowdstrike.com>
wrote:

> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
> depending on your instance types / hypervisor choice, you may want to
> ensure you’re not seeing that bug.
>
>
>
> *From: *Mike Heffner <mike@librato.com>
> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Date: *Friday, July 1, 2016 at 1:10 PM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Cc: *Peter Norton <pcn@librato.com>
> *Subject: *Re: Ring connection timeouts with 2.2.6
>
>
>
> Jens,
>
>
>
> We haven't noticed any particular large GC operations or even persistently
> high GC times.
>
>
>
> Mike
>
>
>
> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.rantil@tink.se> wrote:
>
> Hi,
>
> Could it be garbage collection occurring on nodes that are more heavily
> loaded?
>
> Cheers,
> Jens
>
>
>
> Den sön 26 juni 2016 05:22Mike Heffner <mike@librato.com> skrev:
>
> One thing to add, if we do a rolling restart of the ring the timeouts
> disappear entirely for several hours and performance returns to normal.
> It's as if something is leaking over time, but we haven't seen any
> noticeable change in heap.
>
>
>
> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <mike@librato.com> wrote:
>
> Hi,
>
>
>
> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
> sitting at <25% CPU, doing mostly writes, and not showing any particular
> long GC times/pauses. By all observed metrics the ring is healthy and
> performing well.
>
>
>
> However, we are noticing a pretty consistent number of connection timeouts
> coming from the messaging service between various pairs of nodes in the
> ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
> per minute, usually between two pairs of nodes for several hours at a time.
> It seems to occur for several hours at a time, then may stop or move to
> other pairs of nodes in the ring. The metric
> "Connection.SmallMessageDroppedTasks.<ip>" will also grow for one pair of
> the nodes in the TotalTimeouts metric.
>
>
>
> Looking at the debug log typically shows a large number of messages like
> the following on one of the nodes:
>
>
>
> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177&d=CwMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ&s=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U&e=>
> (ttl 0)
>
> We have cross node timeouts enabled, but ntp is running on all nodes and
> no node appears to have time drift.
>
>
>
> The network appears to be fine between nodes, with iperf tests showing
> that we have a lot of headroom.
>
>
>
> Any thoughts on what to look for? Can we increase thread count/pool sizes
> for the messaging service?
>
>
>
> Thanks,
>
>
>
> Mike
>
>
>
> --
>
>
>   Mike Heffner <mike@librato.com>
>
>   Librato, Inc.
>
>
>
>
>
>
>
> --
>
>
>   Mike Heffner <mike@librato.com>
>
>   Librato, Inc.
>
>
>
> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>
>
>
>
>
> --
>
>
>   Mike Heffner <mike@librato.com>
>
>   Librato, Inc.
>
>
>



-- 

  Mike Heffner <mike@librato.com>
  Librato, Inc.

Mime
View raw message