cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ran Tavory <>
Subject Re: Nodes getting slowed down after a few days of smooth operation
Date Mon, 11 Oct 2010 22:51:49 GMT
Peter, you're my JVM GC hero!
Thank you!

On Tue, Oct 12, 2010 at 12:38 AM, Peter Schuller <> wrote:

> > My motivation was that since I don't have too much data (10G each node)
> then
> > why don't I cache the hell out of it, so I started with a cache size of
> 100%
> > and a much larger heap size (started with 12G out of the 16G ram). Over
> time
> > I've learned that too much heap for the JVM is like a kid in a candy
> shop,
> > it'll eat as much as it can and then throw up (the kid was GC storming),
> In general CMS will tend to gobble up the maximum heap size unless
> your workload is such that the heuristics really work well and don't
> expand the heap beyond some level, but it won't magically fill the
> heap with data that doesn't exist. If you were reaching the maximum
> heap size with 12 GB, making the heap 6 GB instead won't make it
> better.
> Also, just be sure that you're really having an issue with GC. For
> example frequent young-generation GC:s are fully expected and normal.
> If you are seeing extremely frequent concurrent mark/sweep phases that
> do not free up a lot of data - that is an indication that the heap is
> too small.
> So, with respect to "GC storming", a bigger heap is generally better.
> The bigger the heap, the more effective GC is and the less often a
> concurrent mark/sweep has to happen.
> But this does not mean you want to give it too big a heap either,
> since whatever is gobbled up by the heap *won't* be used by the
> operating system for buffer caching.
> Keeping a big row cache may or may not be a good idea depending on
> circumstances, but if you have one, that directly implies additional
> heap usage and the heap must be sized accordingly. The row cache are
> just objects in memory; there is no automatic row cache size
> adjustment in response to heap pressure.
> If 10 million rows is your entire data set, and if that dataset is 10
> GB on disk (without in-memory object overhead), then I am not
> surprised at all that you're seeing issues after a few days of uptime.
> Likely the row cache is just much too big for the heap.
> > so
> > I started lowering the max heap until I reached 6G. with 4G I ran OOM
> BTW.
> Note that OOM and GC storming are often equivalent in terms of their
> cause (unless the OOM is caused by a single huge allocation or
> something). It's just that actually determining whether you are "out
> of memory" is difficult for the JVM, so there are heuristics involved.
> You may be sufficiently out of memory that you see excessive GC
> activity, but not so much as to trigger the threshold of GC
> inefficiency at which the JVM decides to actually through an OOM.
> > So now I have row cach capacity of effectively 100%, a heap size of 6G,
> data
> > of 10G and so I wonder how come the heap doesn't explode?
> Well, everything up to now has suggested to me that it *is* exploding ;)
> But:
> > Well, as it turns out, although I have 10G data on each node, the row
> cache
> > effective size is only about  681 * 2377203 = 1.6G (bytes)
> >                 Key cache: disabled
> >                 Row cache capacity: 10000000
> >                 Row cache size: 2377203
> >                 Row cache hit rate: 0.7017551635100059
> >                 Compacted row minimum size: 392
> >                 Compacted row maximum size: 102961
> >                 Compacted row mean size: 681
> > This strengthens what both Peter and Brandon have suggested that the row
> > cache is generating too much GC b/c it gets invalidated too frequently.
> Note that the compacted row size is not directly indicative of
> in-memory row size. I'm not sure what the overhead is expected to be
> though off hand; but you can probably assume a factor of 2 just from
> general fragmentation issue. Add to that overhead from the
> representation in object form itself etc. 1.6x2 = 3.2. Now we're
> starting to get close, especially taking into account additional
> overhead and other things on the heap.
> > That's certainly possible, so I'll try to set a 50% row cache size on one
> of
> > the nodes (and wait about a week...) and see what happens, and if this
> > proves to be the answer then this means that my dream of "I have so
> little
> > data and so much ram, why don't I cache the hell out of it" isn't going
> to
> > come true b/c too much of the row cache gets invalidated and hence GCed
> > which creates too much overhead for the JVM. (well, at least I was
> getting
> > nice read performance while it lasted ;)
> Given that you're not hitting your maximum cache size, data isn't
> evicted from the cache except as it is updated. Presumably that means
> you're actually not hitting the worst-case scenario, which is LRU
> eviction. Even then though, it's not as simple as it just being too
> much for the JVM. Especially given the rows/second that you'd expect
> to be evicted in Cassandra. A high rate of eviction does mean you need
> more margin in terms of free heap, but I seriously doubt the
> fundamental problem here is GC throughput vs. eviction rate.
> In general, I cannot stress enough - use jconsol/visual vm to observe
> heap usage, or at least check system logs for the results of GC and
> keep track of the heap usage after concurrent mark/sweep collections
> (not ParNew:s) to get a sense for what the actual amount of heap space
> "needed" is.
> > If this is true, then how would you recommend optimizing the row cache
> size
> > for maximum utility and minimum GC overhead?
> For one thing, I doubt a row cache as large as you have is very
> useful. If it takes several days to fill up to the point of you seeing
> memory problems, that suggests to me that it's far larger than
> actually needed. Presumably you will never want to have a system which
> is unable to function until after *days* of warm-up period.
> Simply decreasing it's size significantly would be my recommendation,
> if you keep it at all. Significantly being, I dunno; 100? A rule of
> thumb may be to watch how many rows are populated by X minutes or
> hours of operation (some reasonable warm-up period). Then just take
> that number and use as max (or less).
> Remember that caching will be done by the OS anyway, though that does
> not make the row cache useless (in particular the row cache survives
> compactions, meaning that compactions may have less of an impact when
> there is a row cache involved).
> On minimum GC overhead: Again in terms of GC overhead, the key point
> is to monitor heap usage. Unless you're doing cache eviction at  50k+
> rows/second or something along those lines, I don't think there should
> be any issue unless the cache is too small for the heap.
> > I've pasted here a log snippet from one of the servers while it was at
> high
> > CPU and GCing
> So you have it here:
> INFO [GC inspection] 2010-10-11 02:05:22,857 (line
> 129) GC for ConcurrentMarkSweep: 27428 ms, 183140360 reclaimed leaving
> 6253188640 used; max is 6552551424
> At the end of a collection cycle, which took 27 seconds, it only freed
> below a couple of hundred megs, leaving an almost full heap. This
> simply means flat-out that you have too much data (Java objects) in
> the heap relative to the heap size. Heap size must be increased, or
> memory use decreased. In your case almost certainly the latter.
> > GC runs every like 20-40 seconds and almost for the entire duration of
> that
> > 20-40 secs. I'm not sure what to make of all the other numbers such
> as: GC
> > for ConcurrentMarkSweep: 22742 ms, 181335192 reclaimed leaving 6254994856
> > used; max is 6552551424
> See above; these are the critical lines that tell you probably the
> most out of any other line in the log file about the memories issues.
> Those tell you what the actual "live set" ("XXX used, max is YYY"
> means that XXX out of YYY bytes in the heap is live).
> (I am making *some* simplifications here because the concurrent nature
> of CMS means that you never get a snapshot view; but for practical
> purposes you can consider the above to be true with Cassandra.)
> Similar lines for "ParNew" you can mostly ignore for the purpose of
> monitoring heap usage unless you specifically know what you're looking
> for in those. The ConcurrentMarkSweep ones are what tell you what the
> actual amount of live data in the heap is.
> --
> / Peter Schuller


View raw message