Hey everyone,

We're stress testing writes for a few counter CFs and noticed one one node we got to the point where the ReplicateOnWriteStage thread pool was backed up and it started blocking those tasks. This cluster is six nodes, RF=3, running 1.2.9. All CFs have LCS with 160 MB sstables. All writes were CL.ONE.

Few questions:
  1. What causes a RoW (replicate of write) task to be blocked? The queue maxes out at 4128, which seems to be 32 * (128 + 1). 32 is the number of concurrent_writers we have.

  2. Given this is a counter CF, can those dropped RoWs be repaired with a "nodetool repair?" From my understanding of how counter writes work, until we run that repair, if we're not using CL.ALL / read_repair_chance = 1, we will get some incorrect reads, but a repair will fix things. Is that right?

  3. The CPU on the node where we started seeing the number of blocked tasks increase was pegged, but I/O was not saturated. There were compactions running on those column families as well. Is there a setting we could consider altering that might prevent that back up or is the answer likely, "increase the number of nodes to get more throughput."

Thanks in advance for any insights!

Andrew