incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Burroughs <chris.burrou...@gmail.com>
Subject Multi-dc restart impact
Date Thu, 05 Sep 2013 13:14:38 GMT
We have a 2 DC cluster running cassandra 1.2.9.  They are in actual 
physically separate DCs on opposite coasts of the US, not just logical 
ones.  The primary use of this cluster is CL.ONE reads out of a single 
column family.  My expectation was that in such a scenario restarts 
would have minimal impact in the DC where the restart occurred, and no 
impact in the remote DC.

We are seeing instead that restarts in one DC have a dramatic impact on 
performance in the other (let's call them DCs "A" and "B").

Test scenario on a node in DC "A":
  * disablegossip: no change
  * drain: no change
  * stop node: no change
  * start node again: Large increase in latency in both DCs A *and* B

This is a graph showing the increase in latency 
(org.apache.cassandra.metrics.ClientRequest.Latency.Read.95percentile) 
from DC *B* http://i.imgur.com/OkIQyXI.png  (Actual clients report 
similar numbers that agree with this server side measurement).  Latency 
jumps by over an order of magnitude and out of SLAs.  (I would prefer 
restarting to not cause a latency spike in either DC, but the one 
induced in the remote DC is particularly concerning.)

However, the node that was restarted reports only a minor increase in 
latency http://i.imgur.com/KnGEJrE.png  This is confusing from several 
different angles:
  * I would not expect any cross-dc reads to normally be occurring
  * If there were cross DC reads, they would take 50+ ms instead of < 5 
ms normally reported
  * If the node that was restarted was still somehow involved it reads, 
it's reporting shows it can only account for a small amount of the 
latency increase.

Some possible relevant configurations:
  * GossipingPropertyFileSnitch
  * dynamic_snitch_update_interval_in_ms: 100
  * dynamic_snitch_reset_interval_in_ms: 600000
  * dynamic_snitch_badness_threshold: 0.1
  * read_repair_chance=0.01 and dclocal_read_repair_chance=0.1 (same 
type of behavior was observed with just read_repair_chance=0.1)

Has anyone else observed similar behavior and found a way to limit it? 
This seems like something that ought not to happen but without knowing 
why it is occurring I'm not sure how to stop it.


Mime
View raw message