cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Wee <peich...@gmail.com>
Subject Re: Long GC due to promotion failures
Date Wed, 22 Jan 2014 11:02:04 GMT
 SSTable count: 365

Your sstable counts are too many... don't know what is the best count
should be but for my experience, anything below 20 are good. Is your
compaction running?

I read on a few blog on how should we read cfhistograms, but never really
understood fully. Anyone care to explain using OP attached cfhistogram ?

Taking a wild shot, perhaps trying different build, oracle jdk 1.6u25
perhaps?

HTH

Jason




On Tue, Jan 21, 2014 at 4:02 PM, John Watson <john@disqus.com> wrote:

> Pretty reliable, at some point, nodes will have super long GCs.
> Followed by https://issues.apache.org/jira/browse/CASSANDRA-6592
>
> Lovely log messages:
>
>   9030.798: [ParNew (0: promotion failure size = 4194306)  (2:
> promotion failure size = 4194306)  (4: promotion failure size =
> 4194306)  (promotion failed)
>   Total time for which application threads were stopped: 23.2659990 seconds
>
> Full gc.log until just before restarting the node (see another 32s GC
> near the end): https://gist.github.com/dctrwatson/f04896c215fa2418b1d9
>
> Here's graph of GC time, where we can see a an increase 30 minutes
> prior (indicator that the issue will happen soon):
> http://dl.dropboxusercontent.com/s/q4dr7dle023w9ih/render.png
>
> Graph of various Heap usage:
> http://dl.dropboxusercontent.com/s/e8kd8go25ihbmkl/download.png
>
> Running compactions in the same time frame:
> http://dl.dropboxusercontent.com/s/li9tggk4r2l3u4b/render%20(1).png
>
> CPU, IO, ops and latencies:
>
> https://dl.dropboxusercontent.com/s/yh9osm9urplikb7/2014-01-20%20at%2011.46%20PM%202x.png
>
> cfhistograms/cfstats:
> https://gist.github.com/dctrwatson/9a08b38d0258ae434b15
>
> Cassandra 1.2.13
> Oracle JDK 1.6u45
>
> JVM opts:
>
> MAX_HEAP_SIZE="8G"
> HEAP_NEW_SIZE="1536M"
>
> Tried HEAP_NEW_SIZE of 768M, 800M, 1000M and 1600M
> Tried default "-XX:SurvivorRatio=8" and "-XX:SurvivorRatio=4"
> Tried default "-XX:MaxTenuringThreshold=1" and "-XX:MaxTenuringThreshold=2"
>
> All still eventually ran into long GC.
>
> Hardware for all 3 nodes:
>
> (2) E5520 @ 2.27Ghz (8 cores w/ HT) ["16" cores]
> (6) 4GB RAM [24G RAM]
> (1) 500GB 7.2k for commitlog
> (2) 400G SSD for data (configured as separate data directories)
>

Mime
View raw message