If you had one big cache, wouldn't it be the case that it's mostly populated with frequently accessed rows, and less populated with rarely accessed rows?


In fact, wouldn't one big cache dynamically and automatically give you exactly what you want? If you try to partition the same amount of memory manually, by guesswork, among many tables, aren't you always going to do a worse job?

Suppose you have one CF that's used constantly through interaction by users.  Suppose you have another CF that's only used periodically by a batch process, you tend to access most or all of the rows during the batch process, and it's too large to cache all of the rows.  Normally, you would dedicate cache space to the first CF as anything with human interaction tends to have good temporal locality and you want to keep latencies there low.  On the other hand, caching the second CF provides little to no real benefit.  When you combine these two CFs, every time your batch process runs, rows from the second CF will populate the cache and will cause eviction of rows from the first CF, even though having those rows in the cache provides little benefit to you.

As another example, if you mix a CF with wide rows and a CF with small rows, you no longer have the option of using a row cache, even if it makes great sense for the small-row CF data.

Knowledge of data and access patterns gives you a very good advantage when it comes to caching your data effectively.

Tyler Hobbs
Software Engineer, DataStax
Maintainer of the pycassa Cassandra Python client library