From Ted Pearson <>
Subject Troubleshooting random node latency spikes
Date Thu, 05 Jan 2017 19:34:09 GMT
I'm working on setting up a new cassandra cluster with a write-heavy workload (50% writes),
and I've run into a strange spiky latency problem. My application metrics showed random latency
spikes. I tracked the latency back to spikes on individual cassandra nodes. ClientRequest.Latency.Read/Write.p99
is occasionally jumping on one node at a time to several seconds, instead of its normal value
of around 1000 microseconds. I also noticed that ReadRepair.RepairedBackground.m1_rate goes
from zero to a non-zero (around 1-2/sec) during the spike on that node. I'm lost as to why
these spikes are happening, hope someone can give me ideas.
I attempted to test if the ReadRepair metric is causally linked to the latency spikes, but
even when I changed dclocal_read_repair_chance to 0 on my tables, even though the metrics
showed no ReadRepair.Attempted, the ReadRepair.RepairedBackground metric still went up during
latency spikes. Am I misunderstanding what this metric tracks? I don't understand why it went
up if I turned off read repair.
I'm currently running 2.2.6 in a dual-datacenter setup. It's patched to allow metrics to be
recency-biased instead of tracking latency over the entire running of the java process. I'm
using STCS. There is a large amount of data per node, about 500GB currently. I expect each
row to be less than 10KB. It's currently running on way overpowered hardware - 512GB/raid
0 on nvme/44 cores on 2 sockets. All of my queries (reads and writes) are LOCAL_ONE and I'm
using r=3.


