In case anyone is following this, here is an update:
I was able to narrow it down to Cassandra-Cassandra link. Storage proxy latency depends on size of the key. The larger amount of data (per key) is transfered the larger latency is. No surprise here.
Client connects to a demons "A" and sends key-value, "A" accept thrift message, de-serialize it to an object, sees that key belongs to demons "B", serialize it to bytes once again (internal format now) and invoke MessagingService, which in turn writes to a socket. As soon as "B" delivers write-acknowledgment over a different connection, the client call is let go.
Cassandra's MessagingService utilizes java nio to connect to other cassandra daemons, all connections are uni-directionals. So in theory it should be very fast. But it's not.
What does look suspicious is certain network usage cap, only ~4% of the 1Gbps link is used regardless of "value" size. With smaller value I get a better throughput, with larger (200Kb) - worse.
As a temp workaround I see that client might be held responsible to identifying what cassandra instance it should send a key to. On 200kb value it's ~10 times faster.
I have the following puzzle:
Storage proxy write latency ~235ms
CF write latency <1 ms
I have 3 nodes in the cluster, Cassandra v.0.4. Tokens evenly distributed.
The client connects to a node and inserts a key with ConsistencyLevel.ONE
If it happen to be a local write operation is fast, same speed as in one node setup. JMX shows write latency <1 ms
If it happens to be a remote insert StorageProxy sends it to a proper node. This operation is slow. JMX shows write latency ~ 235ms.
In the same time, on remote node JMX shows same <1ms write latency. So it's not remote node being sluggish, it's something else.
There are no pending tasks on remote node - JMX counters are always zero, network is 1Gb, idle. So I can't blame it.
I profiled Cassandra server in JProfiler, could not find a thing. All this extra time is spent inside QuorumResponseHandler waiting for the condition to signal. Which should happen as soon as response is received.
There is one pooled TCP connection open to remote host. Hardly a bottleneck, ThreadPoolExecutors looks OK.
Any ideas why write latency it is so high?