incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <peter.schul...@infidyne.com>
Subject Re: Cassandra stress test and max vs. average read/write latency.
Date Fri, 23 Dec 2011 07:47:55 GMT
> Thanks for your input.  Can you tell me more about what we should be
> looking for in the gc log?   We've already got the gc logging turned
> on and, and we've already done the plotting to show that in most
> cases the outliers are happening periodically (with a period of
> 10s of seconds to a few minutes, depnding on load and tuning)

Are you measuring writes or reads? If writes,
https://issues.apache.org/jira/browse/CASSANDRA-1991 is still relevant
I think (sorry no progress from my end on that one). Also, I/O
scheduling issues can easily cause problems with the commit log
latency (on fsync()). Try switching to periodic commit log mode and
see if it helps, just to eliminate that (if you're not already in
periodic; if so, try upping the interval).

For reads, I am generally unaware of much aside from GC and legitimate
"jitter" (scheduling/disk I/O etc) that would generate outliers. At
least that I can think of off hand...

And w.r.t. the GC log - yeah, correlating in time is one thing.
Another thing is to confirm what kind of GC pauses you're seeing.
Generally you want to be seeing lots of ParNew:s of shorter duration,
and those are tweakable by changing the young generation size. The
other thing is to make sure CMS is not failing (promotion
failure/concurrent mode failure) and falling back to a stop-the-world
serial compacting GC of the entire heap.

You might also use -:XX+PrintApplicationPauseTime (I think, I am
probably not spelling it entirely correctly) to get a more obvious and
greppable report for each pause, regardless of "type"/cause.

> I've tried to correlate the times of the outliers with messages either
> in the system log or the gc log.   There seemms to be some (but not
> complete) correlation between the outliers and system log messages about
> memtable flushing.   I can not find anything in the gc log that
> seems to be an obvious problem, or that matches up with the time
> times of the outliers.

And these are still the very extreme (500+ ms and such) outliers that
you're seeing w/o GC correlation? Off the top of my head, that seems
very unexpected (assuming a non-saturated system) and would definitely
invite investigation IMO.

If you're willing to start iterating with the source code I'd start
bisecting down the call stack and see where it's happening .

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Mime
View raw message