On 5 August 2013 20:04, Christopher Wirt <email@example.com> wrote:
Question about counters, replication and the ReplicateOnWriteStage
I’ve recently turned on a new CF which uses a counter column.
We have a three DC setup running Cassandra 1.2.4 with vNodes, hex core processors, 32Gb memory.
DC 1 - 9 nodes with RF 3
DC 2 - 3 nodes with RF 2
DC 3 - 3 nodes with RF 2
DC 1 one receives most of the updates to this counter column. ~3k per sec.
I’ve disabled any client reads while I sort out this issue.
Disk utilization is very low
Memory is aplenty (while not reading)
CREATE TABLE cf1 (
PRIMARY KEY (uid, id1, id2, id3)
) WITH …
Three of the machines in DC 1 are reporting very high CPU load.
Looking at tpstats there is a large number of pending ReplicateOnWriteStage just on those machines.
Why would only three of the machines be reporting this?
Assuming its distributed by uuid value there should be an even load across the cluster, yea?
Am I missing something about how distributed counters work?If you have many different uid values and your cluster is balanced then you should see even load. Were your tokens chosen randomly? Did you start out with num_tokens set high or upgrade from num_tokens=1 or an earlier Cassandra version? Is it possible your workload is incrementing the counter for one particular uid much more than the others?The distribution of counters works the same as for non-counters in terms of which nodes receive which values. However, there is a read on the coordinator (randomly chosen for each inc) to read the current value and replicate it to the remaining replicas. This makes counter increments much more expensive than normal inserts, even if all your counters fit in cache. This is done in the ReplicateOnWriteStage, which is why you are seeing that queue build up.
Is changing CL to ONE fine if I’m not too worried about 100% consistency?Yes, but to make the biggest difference you will need to turn off replicate_on_write (alter table cf1 with replicate_on_write = false;) but this *guarantees* your counts aren't replicated, even if all replicas are up. It avoids doing the read, so makes a huge difference to performance, but means that if a node is unavailable later on, you *will* read inconsistent counts. (Or, worse, if a node fails, you will lose counts forever.) This is in contrast to CL.ONE inserts for normal values when inserts are still attempted on all replicas, but only one is required to succeed.
So you might be able to get a temporary performance boost by changing replicate_on_write if your counter values aren't important. But this won't solve the root of the problem.Richard.