cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Viktor Jevdokimov <>
Subject RE: Cassandra nodes failing with OOM
Date Mon, 19 Nov 2012 12:59:54 GMT
We've seen OOM in a situation, when OS was not properly prepared in production.

Best regards / Pagarbiai
Viktor Jevdokimov
Senior Developer

Phone: +370 5 212 3063, Fax +370 5 261 0453
J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
Follow us on Twitter: @adforminsider<!/adforminsider>
Take a ride with Adform's Rich Media Suite<>

[Adform News] <>

Disclaimer: The information contained in this message and attachments is intended solely for
the attention and use of the named addressee and may be confidential. If you are not the intended
recipient, you are reminded that the information remains the property of the sender. You must
not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this
message in error, please contact the sender immediately and irrevocably delete this message
and any copies.

From: [] On Behalf Of Ивaн
Sent: Saturday, November 17, 2012 08:08
Subject: Cassandra nodes failing with OOM

Dear Community,

advice from you needed.

We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM message).
Nodes died in groups of 3, 1, 2. No adjacent died, though we use SimpleSnitch.

Version:         1.1.6
Hardware:      12Gb RAM / 8 cores(virtual)
Data:              40Gb/node
Nodes:           36 nodes

Keyspaces:    2(RF=3, R=W=2) + 1(OpsCenter)
CFs:                36, 2 indexes
Partitioner:      Random
Compaction:   Leveled(we don't want 2x space for housekeeping)
Caching:          Keys only

All is pretty much standard apart from the one CF receiving writes in 64K chunks and having
No JNA installed - this is to be fixed soon.

Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the only change - network
activity spiking.
All the nodes before dying had the following on logs:
> INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 (line 72) MemtablePostFlusher
              1         4         0
> INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 (line 72) FlushWriter
                      1         3         0
> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 (line 72) HintedHandoff
                    1         6         0
> INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 (line 77) CompactionManager
                5         9

GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in 5-10mins.

So, could you please give me a hint on:
1. How much GCInspector warnings per hour are considered 'normal'?
2. What should be the next thing to check?
3. What are the possible failure reasons and how to prevent those?

Thank you very much in advance,

View raw message