Hi, all, 

thank you very much for the help. Aaron was right - we had a multiget_count query, which depending on the app input would result in a calculation performed for ~40k keys.

We've released the fix and ~100 GCInspector warnings per day per node went to ~1 per day per 30 nodes :)

Thank you very much!

Ivan

2012/11/19 Viktor Jevdokimov <Viktor.Jevdokimov@adform.com>

We’ve seen OOM in a situation, when OS was not properly prepared in production.

 

http://www.datastax.com/docs/1.1/install/recommended_settings

 

 

 

Best regards / Pagarbiai
Viktor Jevdokimov
Senior Developer

J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
Follow us on Twitter: @adforminsider
Take a ride with Adform's Rich Media Suite

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.

From: some.unique.login@gmail.com [mailto:some.unique.login@gmail.com] On Behalf Of a Coe
Sent: Saturday, November 17, 2012 08:08
To: user@cassandra.apache.org
Subject: Cassandra nodes failing with OOM

 

Dear Community, 

 

advice from you needed. 

 

We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM message). 

Nodes died in groups of 3, 1, 2. No adjacent died, though we use SimpleSnitch.

 

Version:         1.1.6

Hardware:      12Gb RAM / 8 cores(virtual)

Data:              40Gb/node

Nodes:           36 nodes

 

Keyspaces:    2(RF=3, R=W=2) + 1(OpsCenter)

CFs:                36, 2 indexes

Partitioner:      Random

Compaction:   Leveled(we don't want 2x space for housekeeping)

Caching:          Keys only

 

All is pretty much standard apart from the one CF receiving writes in 64K chunks and having sstable_size_in_mb=100.

No JNA installed - this is to be fixed soon.

 

Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the only change - network activity spiking. 

All the nodes before dying had the following on logs:

> INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (line 72) MemtablePostFlusher               1         4         0

> INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 StatusLogger.java (line 72) FlushWriter                       1         3         0

> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 72) HintedHandoff                     1         6         0

> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 77) CompactionManager                 5         9

 

GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in 5-10mins.

 

So, could you please give me a hint on:

1. How much GCInspector warnings per hour are considered 'normal'?

2. What should be the next thing to check?

3. What are the possible failure reasons and how to prevent those?

 

Thank you very much in advance,

Ivan