From Jouni Hartikainen <>
Subject Write latency spikes
Date Thu, 07 Mar 2013 06:44:49 GMT
Hi all,

I'm experiencing strange latency spikes when writing and trying to figure out what could cause

My setup:
- 3 nodes, writing at CL.ONE using Hector client, no reads
- Writing simultaneously to 3 CFs, inserts with 25h TTL, no deletes, no updates, RF 3
   - 2 CFs have small data (row count < 2000, row size < 500kB, column count/row <
15 000)
   - 1 CF has lots of binary data split into ~60kB columns (row count < 550 000, row sizes
< 2MB, column count/row < 40)
   - Write rate ~300 inserts / s for each CF, total write throughput ~25 MB (bytes) / second
   - data is time series using timestamp as column key
- Cassandra 1.2.2 with 256 vnodes on each machine
- Key cache at default 100MB, no row cache
- 1 x Xeon L5430 CPU, 16GB RAM, 2.3T disc on RAID10 (10k SAS), Sun/Oracle JDK 1.6 (tried also
1.7), 4GB JVM heap, JNA enabled
- all nodes in the same DC, 1Gb network, sub ms latencies between nodes

example cfhistograms:
example proxy histograms:

With this setup I usually get quite nice write latencies of less than 20ms, but sometimes
(~once in a every few minutes) latencies momentarily spike to more than 300ms maxing out at
~2.5 seconds. Spikes are short (< 1 s) and happen on all nodes (but not at the same time).
Even if avg latencies are very good, these spikes cause us headaches due to our SLA.

While investigating I have learned the following:
- No evident GC pressure (nothing in C* logs, GC logging showing constantly < 30ms collection
- No I/O bounds (disks provide ~1GB/s linear write and are mostly idle apart from memtable
flushes for every ~11s)
- No relation between spikes & compaction
- No queuing in memtable FlushWriter, no blocked memtable flushes
- Nothing alarming in logs
- No timeouts, no errors on the client side
- Each client (3 separate machines) experience latencies simultaneously which points to cause
being in C*, not in the client
- CPU load < 10% (< 20% while compacting)
- Latencies measured both from the client and observed using nodetool cfhistograms

Now I'm running out of ideas about what might cause the spikes as I have understood that there
is really not that many places on the write path that could block.

Any ideas?

