Thats a lot of questions, I'll try to answer some...

Read/Write latency as reported for a CF is the time taken to perform a local read on that node. 

Read/Write latency reported on the o.a.c.service.StorageProxy are the time taken to process a complete request, including local and remote reads when CL is > ONE. It includes the time taken to compare the values read from the different nodes, which may include running the read *again* to request full values from the nodes (rather than digest values). 

With logging at DEBUG level these messages in the log may give you some insight into whats happening (Most of this is guess work on my part, others please correct me if I am wrong)

-  strongread/weakread reading digest / data 
  Tells you a read with CL > ONE is starting, digest or data says if the node will return the full data or a hash
  Lines with weakread are CL ONE reads. 

- quorumResponseHandler:
Tells you the requested CL of nodes have responded to the coordinator

- Digest mismatch:
  Tells you the nodes did not agree and coordinator needs to sort it out. I think this will result in Read Repair and the full data (not digest) being read from each node. 

So the read path will be slowed down if a RR is needed. This can be helped by nodetool repair. It can also be slowed if there are a lot fo SSTables on disk, nodetool compact if your friend there. 

Reads to multiple nodes are done in parallel, not serial. DO not know enough about the network parts to help you out there. 

Hope that helps you dig a little deeper.
Aaron


On 07 Oct, 2010,at 02:55 AM, Wayne <wav100@gmail.com> wrote:

I have been seeing some strange trends in read latency that I wanted to throw out there to find some explanations. We are running .6.5 in a 10 node cluster rf=3. We find that the read latency reported by the cfstats is always about 1/4 of the actual time it takes to get the data back to the python client. We are not using any higher level clients, and we usually are doing Quorum reads (rf=3). If we ask for 1 copy (vs. 2 for Quorum) it is around 2x the time reported in cfstats. This is true whether we have a .8 ms read or a 5000 ms read. It is always around 4x the time for a Quorum read and 2x the time for a single value read. This tells me that much of the time waiting for a read has nothing to do with disk random read latency. This is contrary to what is expected.

What is that extra time being used for? Waiting 2 ms for a read value to the client when the value is retrieved in 1ms leaves 1ms that is unexplainable. Is the node being requested by the client doing some "work" that equal the time spent by the node actually serving up the data? Is this the thrift server packaging up the response to the client?

Are reads really more CPU bound? We have lower end CPUs in our nodes, is that part of the cause?

What is cfstats actually reporting? Is it not really reporting on ALL of the time required to service a read? I assume is not reporting the time to send the result to the requesting node.

How much of this time is network time? Would Infiniband or a lower latency network architecture reduce any of these times? If we want to reduce a 2 ms read to a 1ms read what will help us get there? We have cached keys which then gives us a cfstats read latency < 1ms (~.85) but it still takes 2ms to get to the client (single read).

Why does a quorum read double everything? It seems quorum reads are serialized and not parallel. Is that true and if so why? Obviously it takes more time to get 2 values and compare then get one value but if that is always 2x+ then the adjustable consistency of Cassandra comes at a very high price.

Any other suggestions for decreasing read latency? Faster disks don't seem as useful as faster CPUs. We have worked hard to reduce the cfstats reported read latency and have been successful. How can we reduce the time from there to the requesting client? What is the anatomy of a read from client request to result? Where does the time usually go and what can help speed each step up? Caching is the obvious answer but assume we are already caching what can be cached (keys).

Thanks in advance for any advice or explanations anyone might have.