cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: GC freeze just after repair session
Date Wed, 04 Jul 2012 10:37:27 GMT
It *may* have been compaction from the repair, but it's not a big CF.

I would look at the logs to see how much data was transferred to the node. Was their a compaction
going on while the GC storm was happening ? Do you have a lot of secondary indexes ? 

If you think it correlated to compaction you can try reducing the concurrent_compactors 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote:

> Recently, we faced a severe freeze [around 30-40 mins] on one of our servers. There were
many mutations/reads dropped. The issue happened just after a routine nodetool repair for
the below CF completed [1.0.7, NTS, DC1:3,DC2:2]
> 
> 		Column Family: MsgIrtConv
> 		SSTable count: 12
> 		Space used (live): 17426379140
> 		Space used (total): 17426379140
> 		Number of Keys (estimate): 122624
> 		Memtable Columns Count: 31180
> 		Memtable Data Size: 81950175
> 		Memtable Switch Count: 31
> 		Read Count: 8074156
> 		Read Latency: 15.743 ms.
> 		Write Count: 2172404
> 		Write Latency: 0.037 ms.
> 		Pending Tasks: 0
> 		Bloom Filter False Postives: 1258
> 		Bloom Filter False Ratio: 0.03598
> 		Bloom Filter Space Used: 498672
> 		Key cache capacity: 200000
> 		Key cache size: 200000
> 		Key cache hit rate: 0.9965579513062582
> 		Row cache: disabled
> 		Compacted row minimum size: 51
> 		Compacted row maximum size: 89970660
> 		Compacted row mean size: 226626
> 
> 
> Our heap config is as follows
> 
> -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
> 
> from yaml
> in_memory_compaction_limit=64
> compaction_throughput_mb_sec=8
> multi_threaded_compaction=false
> 
>  INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085 AntiEntropyService.java (line 762)
[repair #2b6fcbf0-c1f9-11e1-0000-2ea8811bfbff] MsgIrtConv is fully synced
>  INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085 AntiEntropyService.java (line 698)
[repair #2b6fcbf0-c1f9-11e1-0000-2ea8811bfbff] session completed successfully
>  INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219 CompactionTask.java (line 221)
Compacted to [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,].  47,907,012 to 40,554,059
(~84% of original) bytes for 4,564 keys at 6.252080MB/s.  Time: 6,186ms.
> 
> After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran for every 3 seconds,
while CMS ran for every 30 seconds approx continuous for 40 minutes.
> 
>  INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line 122) GC for ParNew:
776 ms for 2 collections, 2901990208 used; max is 8506048512
>  INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line 122) GC for ParNew:
2028 ms for 2 collections, 3831282056 used; max is 8506048512
> 
> .........................................
> 
>  INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line 122) GC for ParNew:
817 ms for 2 collections, 2808685768 used; max is 8506048512
>  INFO [ScheduledTasks:1] 2012-06-29 10:07:55,632 GCInspector.java (line 122) GC for ParNew:
1165 ms for 3 collections, 3264696776 used; max is 8506048512
>  INFO [ScheduledTasks:1] 2012-06-29 10:07:57,773 GCInspector.java (line 122) GC for ParNew:
1444 ms for 3 collections, 4234372296 used; max is 8506048512
>  INFO [ScheduledTasks:1] 2012-06-29 10:07:59,387 GCInspector.java (line 122) GC for ParNew:
1153 ms for 2 collections, 4910279080 used; max is 8506048512
>  INFO [ScheduledTasks:1] 2012-06-29 10:08:00,389 GCInspector.java (line 122) GC for ParNew:
697 ms for 2 collections, 4873857072 used; max is 8506048512
>  INFO [ScheduledTasks:1] 2012-06-29 10:08:01,443 GCInspector.java (line 122) GC for ParNew:
726 ms for 2 collections, 4941511184 used; max is 8506048512
> 
> After this, the node got stable and was back and running. Any pointers will be greatly
helpful


Mime
View raw message