thank you very much for the help. Aaron was right - we had a multiget_count query, which depending on the app input would result in a calculation performed for ~40k keys.

We've released the fix and ~100 GCInspector warnings per day per node went to ~1 per day per 30 nodes :)

We’ve seen OOM in a situation, when OS was not properly prepared in production.






We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM message). 

Nodes died in groups of 3, 1, 2. No adjacent died, though we use SimpleSnitch.


Version:         1.1.6

Hardware:      12Gb RAM / 8 cores(virtual)

Data:              40Gb/node

Nodes:           36 nodes


Keyspaces:    2(RF=3, R=W=2) + 1(OpsCenter)

CFs:                36, 2 indexes

Partitioner:      Random

Compaction:   Leveled(we don't want 2x space for housekeeping)

Caching:          Keys only


All is pretty much standard apart from the one CF receiving writes in 64K chunks and having sstable_size_in_mb=100.

No JNA installed - this is to be fixed soon.


Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the only change - network activity spiking. 

All the nodes before dying had the following on logs:

> INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (line 72) MemtablePostFlusher               1         4         0

> INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 StatusLogger.java (line 72) FlushWriter                       1         3         0

> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 72) HintedHandoff                     1         6         0

> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 77) CompactionManager                 5         9


GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in 5-10mins.


So, could you please give me a hint on:

1. How much GCInspector warnings per hour are considered 'normal'?

2. What should be the next thing to check?

3. What are the possible failure reasons and how to prevent those?


