cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <>
Subject Re: Nodes getting slowed down after a few days of smooth operation
Date Mon, 11 Oct 2010 22:38:14 GMT
> My motivation was that since I don't have too much data (10G each node) then
> why don't I cache the hell out of it, so I started with a cache size of 100%
> and a much larger heap size (started with 12G out of the 16G ram). Over time
> I've learned that too much heap for the JVM is like a kid in a candy shop,
> it'll eat as much as it can and then throw up (the kid was GC storming),

In general CMS will tend to gobble up the maximum heap size unless
your workload is such that the heuristics really work well and don't
expand the heap beyond some level, but it won't magically fill the
heap with data that doesn't exist. If you were reaching the maximum
heap size with 12 GB, making the heap 6 GB instead won't make it

Also, just be sure that you're really having an issue with GC. For
example frequent young-generation GC:s are fully expected and normal.
If you are seeing extremely frequent concurrent mark/sweep phases that
do not free up a lot of data - that is an indication that the heap is
too small.

So, with respect to "GC storming", a bigger heap is generally better.
The bigger the heap, the more effective GC is and the less often a
concurrent mark/sweep has to happen.

But this does not mean you want to give it too big a heap either,
since whatever is gobbled up by the heap *won't* be used by the
operating system for buffer caching.

Keeping a big row cache may or may not be a good idea depending on
circumstances, but if you have one, that directly implies additional
heap usage and the heap must be sized accordingly. The row cache are
just objects in memory; there is no automatic row cache size
adjustment in response to heap pressure.

If 10 million rows is your entire data set, and if that dataset is 10
GB on disk (without in-memory object overhead), then I am not
surprised at all that you're seeing issues after a few days of uptime.
Likely the row cache is just much too big for the heap.

> so
> I started lowering the max heap until I reached 6G. with 4G I ran OOM BTW.

Note that OOM and GC storming are often equivalent in terms of their
cause (unless the OOM is caused by a single huge allocation or
something). It's just that actually determining whether you are "out
of memory" is difficult for the JVM, so there are heuristics involved.
You may be sufficiently out of memory that you see excessive GC
activity, but not so much as to trigger the threshold of GC
inefficiency at which the JVM decides to actually through an OOM.

> So now I have row cach capacity of effectively 100%, a heap size of 6G, data
> of 10G and so I wonder how come the heap doesn't explode?

Well, everything up to now has suggested to me that it *is* exploding ;)


> Well, as it turns out, although I have 10G data on each node, the row cache
> effective size is only about  681 * 2377203 = 1.6G (bytes)
>                 Key cache: disabled
>                 Row cache capacity: 10000000
>                 Row cache size: 2377203
>                 Row cache hit rate: 0.7017551635100059
>                 Compacted row minimum size: 392
>                 Compacted row maximum size: 102961
>                 Compacted row mean size: 681
> This strengthens what both Peter and Brandon have suggested that the row
> cache is generating too much GC b/c it gets invalidated too frequently.

Note that the compacted row size is not directly indicative of
in-memory row size. I'm not sure what the overhead is expected to be
though off hand; but you can probably assume a factor of 2 just from
general fragmentation issue. Add to that overhead from the
representation in object form itself etc. 1.6x2 = 3.2. Now we're
starting to get close, especially taking into account additional
overhead and other things on the heap.

> That's certainly possible, so I'll try to set a 50% row cache size on one of
> the nodes (and wait about a week...) and see what happens, and if this
> proves to be the answer then this means that my dream of "I have so little
> data and so much ram, why don't I cache the hell out of it" isn't going to
> come true b/c too much of the row cache gets invalidated and hence GCed
> which creates too much overhead for the JVM. (well, at least I was getting
> nice read performance while it lasted ;)

Given that you're not hitting your maximum cache size, data isn't
evicted from the cache except as it is updated. Presumably that means
you're actually not hitting the worst-case scenario, which is LRU
eviction. Even then though, it's not as simple as it just being too
much for the JVM. Especially given the rows/second that you'd expect
to be evicted in Cassandra. A high rate of eviction does mean you need
more margin in terms of free heap, but I seriously doubt the
fundamental problem here is GC throughput vs. eviction rate.

In general, I cannot stress enough - use jconsol/visual vm to observe
heap usage, or at least check system logs for the results of GC and
keep track of the heap usage after concurrent mark/sweep collections
(not ParNew:s) to get a sense for what the actual amount of heap space
"needed" is.

> If this is true, then how would you recommend optimizing the row cache size
> for maximum utility and minimum GC overhead?

For one thing, I doubt a row cache as large as you have is very
useful. If it takes several days to fill up to the point of you seeing
memory problems, that suggests to me that it's far larger than
actually needed. Presumably you will never want to have a system which
is unable to function until after *days* of warm-up period.

Simply decreasing it's size significantly would be my recommendation,
if you keep it at all. Significantly being, I dunno; 100? A rule of
thumb may be to watch how many rows are populated by X minutes or
hours of operation (some reasonable warm-up period). Then just take
that number and use as max (or less).

Remember that caching will be done by the OS anyway, though that does
not make the row cache useless (in particular the row cache survives
compactions, meaning that compactions may have less of an impact when
there is a row cache involved).

On minimum GC overhead: Again in terms of GC overhead, the key point
is to monitor heap usage. Unless you're doing cache eviction at  50k+
rows/second or something along those lines, I don't think there should
be any issue unless the cache is too small for the heap.

> I've pasted here a log snippet from one of the servers while it was at high
> CPU and GCing

So you have it here:

INFO [GC inspection] 2010-10-11 02:05:22,857 (line
129) GC for ConcurrentMarkSweep: 27428 ms, 183140360 reclaimed leaving
6253188640 used; max is 6552551424

At the end of a collection cycle, which took 27 seconds, it only freed
below a couple of hundred megs, leaving an almost full heap. This
simply means flat-out that you have too much data (Java objects) in
the heap relative to the heap size. Heap size must be increased, or
memory use decreased. In your case almost certainly the latter.

> GC runs every like 20-40 seconds and almost for the entire duration of that
> 20-40 secs. I'm not sure what to make of all the other numbers such as: GC
> for ConcurrentMarkSweep: 22742 ms, 181335192 reclaimed leaving 6254994856
> used; max is 6552551424

See above; these are the critical lines that tell you probably the
most out of any other line in the log file about the memories issues.
Those tell you what the actual "live set" ("XXX used, max is YYY"
means that XXX out of YYY bytes in the heap is live).

(I am making *some* simplifications here because the concurrent nature
of CMS means that you never get a snapshot view; but for practical
purposes you can consider the above to be true with Cassandra.)

Similar lines for "ParNew" you can mostly ignore for the purpose of
monitoring heap usage unless you specifically know what you're looking
for in those. The ConcurrentMarkSweep ones are what tell you what the
actual amount of live data in the heap is.

/ Peter Schuller

View raw message