We're running a small Cassandra cluster (v1.0.10) serving data to our web application, and as our traffic grows, we're starting to see some weird issues. The biggest of these is that sometimes, a single node becomes unresponsive. It's impossible to start new connections, or impossible to send requests, or it just doesn't return anything when you've sent a request. Our client library is set to retry on another server when this happens, and what we see then is that the request is usually served instantly. So it's not the case that some requests are very difficult, it's that sometimes a node is just "busy", and we have no idea why or what it's doing.
We're using MRTG and Monit to monitor the servers, and in MRTG the average CPU usage is around 5%, on our quad-core Xeon servers with SSDs. But occassionally through Monit we can see that the 1-min load average goes above 4, and this usually corresponds to the above issues. Is this common? Does this happen to everyone else? And why the spikiness in load? We can't find anything in the cassandra logs indicating that something's up (such as a slow GC or compaction), and there's no corresponding traffic spike in the application either. Should we just add more nodes if any single one gets CPU spikes?
Another explanation could also be that we've configured it wrong. We're running pretty much default config. Each node has 16G of RAM, 4GB of heap, no row-cache and an sizeable key-cache. Despite that, we occasionally get OOM exceptions, and nodes crashing, maybe a few times per month. Should we increase heap size? Or move to 1.1 and enable off-heap caching?
We have quite a lot of traffic to the cluster. A single keyspace with two column families, RF=3, compression is enabled, and we're running the default size-tiered compaction.
Column family A has 60GB of actual data, each row has a single column, and that column holds binary data that varies in size up to 500kB. When we update a row, we write a new value to this single column, effectively replacing that entire row. We do ~1000 updates/s, totalling ~10MB/s in writes.
Column family B also has 60GB of actual data, but each row has a variable (~100-10000) number of supercolumns, and each supercolumn has a fixed number of columns with a fixed amount of data, totalling ~1kB. The operations we are doing on this column family is that we add supercolumns to rows with a rate of ~200/s, and occasionally we do bulk deletion of supercolumns in a row.
The config options we are unsure about are things like commit log sizes, memtable flushing thresholds, commit log syncing, compaction throughput, etc. Are we at a point with our data size and write load that the defaults aren't good enough anymore? Should we stick with size-tiered compaction, even though our application is update-heavy?