Regarding memory usage after a repair ... Are the merkle trees kept around?

On Oct 23, 2012 3:00 PM, "Bryan Talbot" <btalbot@aeriagames.com> wrote:
On Mon, Oct 22, 2012 at 6:05 PM, aaron morton <aaron@thelastpickle.com> wrote:
The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued.
If you restart a node is the onset of GC activity correlated to some event?

Yes and no.  When the nodes were generally under the .75 occupancy threshold a weekly "repair -pr" job would cause them to go over the threshold and then stay there even after the repair had completed and there were no ongoing compactions.  It acts as though at least some substantial amount of memory used during repair was never dereferenced once the repair was complete.

Once one CF in particular grew larger the constant GC would start up pretty soon (less than 90 minutes) after a node restart even without a repair.


 
 
As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped.  So it's definitely related to data load.  memtable size is 1 GB, row cache is disabled and key cache is small (default).
How many keys did the CF have per node? 
I dismissed the memory used to  hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. 


I've changed the app to retain less data for that CF but I think that it was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows are write-once, never updated, and rarely read.  There are no secondary indexes for this particular CF.


 
 They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like the default bloom_filter_fp_chance defaults to 0.0 
The default should be 0.000744.

If the chance is zero or null this code should run when a new SSTable is written 
  // paranoia -- we've had bugs in the thrift <-> avro <-> CfDef dance before, let's not let that break things
                logger.error("Bloom filter FP chance of zero isn't supposed to happen");

Were the CF's migrated from an old version ?


Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 1.1.5 with a "upgradesstables" run at each upgrade along the way.

I could not find a way to view the current bloom_filter_fp_chance settings when they are at a default value.  JMX reports the actual fp rate and if a specific rate is set for a CF that shows up in "describe table" but I couldn't find out how to tell what the default was.  I didn't inspect the source.

 
Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known?

See o.a.c.utils.BloomFilter.getFilter() in the code 
This http://hur.st/bloomfilter appears to give similar results. 




Ahh, very helpful.  This indicates that 714MB would be used for the bloom filter for that one CF.

JMX / cfstats reports "Bloom Filter Space Used" but the MBean method name (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk and in-memory space used is similar then summing up all the "Bloom Filter Space Used" says they're currently consuming 1-2 GB of the heap which is substantial.

If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It just means more trips to SSTable indexes for a read correct?  Trade RAM for time (disk I/O).

-Bryan