incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: constant CMS GC using CPU time
Date Thu, 25 Oct 2012 08:34:37 GMT
> Regarding memory usage after a repair ... Are the merkle trees kept around?
> 

They should not be.

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/10/2012, at 4:51 PM, B. Todd Burruss <btoddb@gmail.com> wrote:

> Regarding memory usage after a repair ... Are the merkle trees kept around?
> 
> On Oct 23, 2012 3:00 PM, "Bryan Talbot" <btalbot@aeriagames.com> wrote:
> On Mon, Oct 22, 2012 at 6:05 PM, aaron morton <aaron@thelastpickle.com> wrote:
>> The GC was on-going even when the nodes were not compacting or running a heavy application
load -- even when the main app was paused constant the GC continued.
> If you restart a node is the onset of GC activity correlated to some event?
> 
> Yes and no.  When the nodes were generally under the .75 occupancy threshold a weekly
"repair -pr" job would cause them to go over the threshold and then stay there even after
the repair had completed and there were no ongoing compactions.  It acts as though at least
some substantial amount of memory used during repair was never dereferenced once the repair
was complete.
> 
> Once one CF in particular grew larger the constant GC would start up pretty soon (less
than 90 minutes) after a node restart even without a repair.
> 
> 
>  
>  
>> As a test we dropped the largest CF and the memory usage immediately dropped to acceptable
levels and the constant GC stopped.  So it's definitely related to data load.  memtable size
is 1 GB, row cache is disabled and key cache is small (default).
> How many keys did the CF have per node? 
> I dismissed the memory used to  hold bloom filters and index sampling. That memory is
not considered part of the memtable size, and will end up in the tenured heap. It is generally
only a problem with very large key counts per node. 
> 
> 
> I've changed the app to retain less data for that CF but I think that it was about 400M
rows per node.  Row keys are a TimeUUID.  All of the rows are write-once, never updated, and
rarely read.  There are no secondary indexes for this particular CF.
> 
> 
>  
>>  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like the default
bloom_filter_fp_chance defaults to 0.0 
> The default should be 0.000744.
> 
> If the chance is zero or null this code should run when a new SSTable is written 
>   // paranoia -- we've had bugs in the thrift <-> avro <-> CfDef dance before,
let's not let that break things
>                 logger.error("Bloom filter FP chance of zero isn't supposed to happen");
> 
> Were the CF's migrated from an old version ?
> 
> 
> Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 1.1.5 with
a "upgradesstables" run at each upgrade along the way.
> 
> I could not find a way to view the current bloom_filter_fp_chance settings when they
are at a default value.  JMX reports the actual fp rate and if a specific rate is set for
a CF that shows up in "describe table" but I couldn't find out how to tell what the default
was.  I didn't inspect the source.
> 
>  
>> Is there any way to predict how much memory the bloom filters will consume if the
size of the row keys, number or rows is known, and fp chance is known?
> 
> See o.a.c.utils.BloomFilter.getFilter() in the code 
> This http://hur.st/bloomfilter appears to give similar results. 
> 
> 
> 
> 
> Ahh, very helpful.  This indicates that 714MB would be used for the bloom filter for
that one CF.
> 
> JMX / cfstats reports "Bloom Filter Space Used" but the MBean method name (getBloomFilterDiskSpaceUsed)
indicates this is the on-disk space. If on-disk and in-memory space used is similar then summing
up all the "Bloom Filter Space Used" says they're currently consuming 1-2 GB of the heap which
is substantial.
> 
> If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It just means
more trips to SSTable indexes for a read correct?  Trade RAM for time (disk I/O).
> 
> -Bryan
> 


Mime
View raw message