We have modified maxTenuringThreshold from 1 to 5. May be it is causing problems. Will change it back to 1 and see how the system is.

concurrent_compactors=8. We will reduce this, as anyway our system won't be able to handle this number of compactions at the same time. Think it will ease GC also to some extent.

Ideally we would like to collect maximum garbage from ParNew itself, during compactions. What are the steps to take towards to achieving this?

On Wed, Jul 4, 2012 at 4:07 PM, aaron morton <aaron@thelastpickle.com> wrote:
It *may* have been compaction from the repair, but it's not a big CF.

I would look at the logs to see how much data was transferred to the node. Was their a compaction going on while the GC storm was happening ? Do you have a lot of secondary indexes ? 

If you think it correlated to compaction you can try reducing the concurrent_compactors 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote:

Recently, we faced a severe freeze [around 30-40 mins] on one of our servers. There were many mutations/reads dropped. The issue happened just after a routine nodetool repair for the below CF completed [1.0.7, NTS, DC1:3,DC2:2]

Column Family: MsgIrtConv
SSTable count: 12
Space used (live): 17426379140
Space used (total): 17426379140
Number of Keys (estimate): 122624
Memtable Columns Count: 31180
Memtable Data Size: 81950175
Memtable Switch Count: 31
Read Count: 8074156
Read Latency: 15.743 ms.
Write Count: 2172404
Write Latency: 0.037 ms.
Pending Tasks: 0
Bloom Filter False Postives: 1258
Bloom Filter False Ratio: 0.03598
Bloom Filter Space Used: 498672
Key cache capacity: 200000
Key cache size: 200000
Key cache hit rate: 0.9965579513062582
Row cache: disabled
Compacted row minimum size: 51
Compacted row maximum size: 89970660
Compacted row mean size: 226626


Our heap config is as follows

-Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly

from yaml
in_memory_compaction_limit=64
compaction_throughput_mb_sec=8
multi_threaded_compaction=false

 INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085 AntiEntropyService.java (line 762) [repair #2b6fcbf0-c1f9-11e1-0000-2ea8811bfbff] MsgIrtConv is fully synced
 INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085 AntiEntropyService.java (line 698) [repair #2b6fcbf0-c1f9-11e1-0000-2ea8811bfbff] session completed successfully
 INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219 CompactionTask.java (line 221) Compacted to [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,].  47,907,012 to 40,554,059 (~84% of original) bytes for 4,564 keys at 6.252080MB/s.  Time: 6,186ms.

After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran for every 3 seconds, while CMS ran for every 30 seconds approx continuous for 40 minutes.

 INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line 122) GC for ParNew: 776 ms for 2 collections, 2901990208 used; max is 8506048512
 INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line 122) GC for ParNew: 2028 ms for 2 collections, 3831282056 used; max is 8506048512

.........................................

 INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line 122) GC for ParNew: 817 ms for 2 collections, 2808685768 used; max is 8506048512
 INFO [ScheduledTasks:1] 2012-06-29 10:07:55,632 GCInspector.java (line 122) GC for ParNew: 1165 ms for 3 collections, 3264696776 used; max is 8506048512
 INFO [ScheduledTasks:1] 2012-06-29 10:07:57,773 GCInspector.java (line 122) GC for ParNew: 1444 ms for 3 collections, 4234372296 used; max is 8506048512
 INFO [ScheduledTasks:1] 2012-06-29 10:07:59,387 GCInspector.java (line 122) GC for ParNew: 1153 ms for 2 collections, 4910279080 used; max is 8506048512
 INFO [ScheduledTasks:1] 2012-06-29 10:08:00,389 GCInspector.java (line 122) GC for ParNew: 697 ms for 2 collections, 4873857072 used; max is 8506048512
 INFO [ScheduledTasks:1] 2012-06-29 10:08:01,443 GCInspector.java (line 122) GC for ParNew: 726 ms for 2 collections, 4941511184 used; max is 8506048512

After this, the node got stable and was back and running. Any pointers will be greatly helpful