On Thu, Oct 25, 2012 at 4:15 AM, aaron morton <aaron@thelastpickle.com> wrote:
This sounds very much like "my heap is so consumed by (mostly) bloom
filters that I am in steady state GC thrash."

Yes, I think that was at least part of the issue.

The rough numbers I've used to estimate working set are:

* bloom filter size for 400M rows at 0.00074 fp without java fudge (they are just a big array) 714 MB
* memtable size 1024 MB 
* index sampling:
*  24 bytes + key (16 bytes for UUID) = 32 bytes 
* 400M / 128 default sampling = 3,125,000
*  3,125,000 * 32 = 95 MB
* java fudge X5 or X10 = 475MB to 950MB
* ignoring row cache and key cache
So the high side number is 2213 to 2,688. High because the fudge is a delicious sticky guess and the memtable space would rarely be full. 

On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the working set and how quickly stuff get's tenured which depends on the workload. 

These values seem reasonable and in line with what I was seeing.  There are other CF and apps sharing this cluster but this one was the largest.  


You can confirm these guesses somewhat manually by enabling all the GC logging in cassandra-env.sh. Restart the node and let it operate normally, probably best to keep repair off.

I was using jstat to monitor gc activity and some snippets from that are in my original email in this thread.  The key behavior was that full gc was running pretty often and never able to reclaim much (if any) space.


There are a few things you could try:

* increase the JVM heap by say 1Gb and see how it goes
* increase bloom filter false positive,  try 0.1 first (see http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance
* increase index_interval sampling in yaml.  
* decreasing compaction_throughput and in_memory_compaction_limit can lesson the additional memory pressure compaction adds. 
* disable caches or ensure off heap caches are used.

I've done several of these already in addition to changing the app to reduce the number of rows retained.  How does compaction_throughput relate to memory usage?  I assumed that was more for IO tuning.  I noticed that lowering concurrent_compactors to 4 (from default of 8) lowered the memory used during compactions.  in_memory_compaction_limit_in_mb seems to only be used for wide rows and this CF didn't have any wider than in_memory_compaction_limit_in_mb.  My multithreaded_compaction is still false.


Watching the gc logs and the cassandra log is a great way to get a feel for what works in your situation. Also take note of any scheduled processing your app does which may impact things, and look for poorly performing queries. 

Finally this book is a good reference on Java GC http://amzn.com/0137142528 

For my understanding what was the average row size for the 400 million keys ? 

The compacted row mean size for the CF is 8815 (as reported by cfstats) but that comes out to be much larger than the real load per node I was seeing.  Each node had about 200GB of data for the CF with 4 nodes in the cluster and RF=3.  At the time, the TTL for all columns was 3 days and gc_grace_seconds was 5 days.  Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so the number of rows and data dropped to a level it can handle.