I have been seeing some strange trends in read latency that I wanted to throw out there to find some explanations. We are running .6.5 in a 10 node cluster rf=3. We find that the read latency reported by the cfstats is always about 1/4 of the actual time it takes to get the data back to the python client. We are not using any higher level clients, and we usually are doing Quorum reads (rf=3). If we ask for 1 copy (vs. 2 for Quorum) it is around 2x the time reported in cfstats. This is true whether we have a .8 ms read or a 5000 ms read. It is always around 4x the time for a Quorum read and 2x the time for a single value read. This tells me that much of the time waiting for a read has nothing to do with disk random read latency. This is contrary to what is expected.

What is that extra time being used for? Waiting 2 ms for a read value to the client when the value is retrieved in 1ms leaves 1ms that is unexplainable. Is the node being requested by the client doing some "work" that equal the time spent by the node actually serving up the data? Is this the thrift server packaging up the response to the client?

Are reads really more CPU bound? We have lower end CPUs in our nodes, is that part of the cause?

What is cfstats actually reporting? Is it not really reporting on ALL of the time required to service a read? I assume is not reporting the time to send the result to the requesting node.

How much of this time is network time? Would Infiniband or a lower latency network architecture reduce any of these times? If we want to reduce a 2 ms read to a 1ms read what will help us get there? We have cached keys which then gives us a cfstats read latency < 1ms (~.85) but it still takes 2ms to get to the client (single read).

Why does a quorum read double everything? It seems quorum reads are serialized and not parallel. Is that true and if so why? Obviously it takes more time to get 2 values and compare then get one value but if that is always 2x+ then the adjustable consistency of Cassandra comes at a very high price.

Any other suggestions for decreasing read latency? Faster disks don't seem as useful as faster CPUs. We have worked hard to reduce the cfstats reported read latency and have been successful. How can we reduce the time from there to the requesting client? What is the anatomy of a read from client request to result? Where does the time usually go and what can help speed each step up? Caching is the obvious answer but assume we are already caching what can be cached (keys).

Thanks in advance for any advice or explanations anyone might have.