incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Zhu <wz1...@yahoo.com>
Subject Re: Write latency spikes
Date Thu, 07 Mar 2013 19:13:11 GMT
If you are tight about your SLA, try set socketTimeout from Hector with small number so that
it can retry faster given the assumption that your write is idempotent.

Regarding your write latency, don't have much insight. We see spike on the reads due to GC/compaction
etc. But not write latency.

Thanks.
-Wei

----- Original Message -----
From: "Jouni Hartikainen" <jouni.hartikainen@reaktor.fi>
To: user@cassandra.apache.org
Sent: Wednesday, March 6, 2013 10:44:49 PM
Subject: Write latency spikes

Hi all,

I'm experiencing strange latency spikes when writing and trying to figure out what could cause
them.

My setup:
- 3 nodes, writing at CL.ONE using Hector client, no reads
- Writing simultaneously to 3 CFs, inserts with 25h TTL, no deletes, no updates, RF 3
   - 2 CFs have small data (row count < 2000, row size < 500kB, column count/row <
15 000)
   - 1 CF has lots of binary data split into ~60kB columns (row count < 550 000, row sizes
< 2MB, column count/row < 40)
   - Write rate ~300 inserts / s for each CF, total write throughput ~25 MB (bytes) / second
   - data is time series using timestamp as column key
- Cassandra 1.2.2 with 256 vnodes on each machine
- Key cache at default 100MB, no row cache
- 1 x Xeon L5430 CPU, 16GB RAM, 2.3T disc on RAID10 (10k SAS), Sun/Oracle JDK 1.6 (tried also
1.7), 4GB JVM heap, JNA enabled
- all nodes in the same DC, 1Gb network, sub ms latencies between nodes

cassandra.yaml: http://pastebin.com/MSr2prpb
cfstats: http://pastebin.com/Ax5vPUcY
example cfhistograms: http://pastebin.com/qYSL1MX3
example proxy histograms: http://pastebin.com/X3AGGEjh

With this setup I usually get quite nice write latencies of less than 20ms, but sometimes
(~once in a every few minutes) latencies momentarily spike to more than 300ms maxing out at
~2.5 seconds. Spikes are short (< 1 s) and happen on all nodes (but not at the same time).
Even if avg latencies are very good, these spikes cause us headaches due to our SLA.

While investigating I have learned the following:
- No evident GC pressure (nothing in C* logs, GC logging showing constantly < 30ms collection
pauses)
- No I/O bounds (disks provide ~1GB/s linear write and are mostly idle apart from memtable
flushes for every ~11s)
- No relation between spikes & compaction
- No queuing in memtable FlushWriter, no blocked memtable flushes
- Nothing alarming in logs
- No timeouts, no errors on the client side
- Each client (3 separate machines) experience latencies simultaneously which points to cause
being in C*, not in the client
- CPU load < 10% (< 20% while compacting)
- Latencies measured both from the client and observed using nodetool cfhistograms

Now I'm running out of ideas about what might cause the spikes as I have understood that there
is really not that many places on the write path that could block.

Any ideas?

-Jouni

Mime
View raw message