I mentioned this today to a couple folks at Cassandra Summit, and thought I'd solicit some more thoughts here.

Currently, the read stage includes checking row cache. So if your concurrent reads is N and you have N reads reading from disk, the next read will block until a disk read finishes, even if it's in row cache. Would it make sense to isolate disk reads from cache reads? To either make the read stage be only used on misses, or to make 2 read stages CacheRead and DiskRead? Of course, we'd have to go to DiskRead for mmap since we wouldn't know until we asked the OS.

My thought is that stages should be based on resources rather than semantics, but that may be wrong. Logically, I don't think it would make sense to have the read stage bounded in a hypothetical system where there is no IO; it's most likely because of the disk and subsequent IO contention that that cap was introduced.

As a possible bonus with this change, you can make other optimizations like batching row reads from disk where the keys were in key cache (does this even make sense? I'm not too sure how that would work).

Let me know what you guys think.