incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Laing, Michael" <michael.la...@nytimes.com>
Subject Re: High latency on 5 node Cassandra Cluster
Date Wed, 04 Jun 2014 10:51:58 GMT
I would first check to see if there was a time synchronization issue among
nodes that triggered and/or perpetuated the event.

ml


On Wed, Jun 4, 2014 at 3:12 AM, Arup Chakrabarti <arup@pagerduty.com> wrote:

> Hello. We had some major latency problems yesterday with our 5 node
> cassandra cluster. Wanted to get some feedback on where we could start to
> look to figure out what was causing the issue. If there is more info I
> should provide, please let me know.
>
> Here are the basics of the cluster:
> Clients: Hector and Cassie
> Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont)
> Replication Factor: 5
> Quorum Reads and Writes enabled
> Read Repair set to true
> Cassandra Version: 1.0.12
>
> We started experiencing catastrophic latency from our app servers. We
> believed at the time this was due to compactions running, and the clients
> were not re-routing appropriately, so we disabled thrift on a single node
> that had high load. This did not resolve the issue. After that, we stopped
> gossip on the same node that had high load on it, again this did not
> resolve anything. We then took down gossip on another node (leaving 3/5 up)
> and that fixed the latency from the application side. For a period of ~4
> hours, every time we would try to bring up a fourth node, the app would see
> the latency again. We then rotated the three nodes that were up to make
> sure it was not a networking event related to a single region/provider and
> we kept seeing the same problem: 3 nodes showed no latency problem, 4 or 5
> nodes would. After the ~4hours, we brought the cluster up to 5 nodes and
> everything was fine.
>
> We currently have some ideas on what caused this behavior, but has anyone
> else seen this type of problem where a full cluster causes problems, but
> removing nodes fixes it? Any input on what to look for in our logs to
> understand the issue?
>
> Thanks
>
> Arup
>

Mime
View raw message