incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Cassandra nodes failing with OOM
Date Sun, 18 Nov 2012 19:38:59 GMT
> 1. How much GCInspector warnings per hour are considered 'normal'?
None. 
A couple during compaction or repair is not the end of the world. But if you have enough to
thinking about "per hour" it's too many. 

> 2. What should be the next thing to check?
Try to determine if the GC activity correlates to application workload, compaction or repair.


Try to determine what the working set of the server is. Watch the GC activity (via gc logs
or JMX) and see what the size of the tenured heap is after a CMS. Or try to calculate it http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

Look at your data model and query patterns for places where very large queries are being made.
Or rows that are very long lived with a lot of deletes (prob not as much as an issue with
LDB). 
 

> 3. What are the possible failure reasons and how to prevent those?

As above. 
As a work around sometimes drastically slowing down compaction can help. For LDB try reducing
in_memory_compaction_limit_in_mb and compaction_throughput_mb_per_sec


Hope that helps. 

 
-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/11/2012, at 7:07 PM, Ивaн Cобoлeв <soboleiv@gmail.com> wrote:

> Dear Community, 
> 
> advice from you needed. 
> 
> We have a cluster, 1/6 nodes of which died for various reasons(3 had OOM message). 
> Nodes died in groups of 3, 1, 2. No adjacent died, though we use SimpleSnitch.
> 
> Version:         1.1.6
> Hardware:      12Gb RAM / 8 cores(virtual)
> Data:              40Gb/node
> Nodes:           36 nodes
> 
> Keyspaces:    2(RF=3, R=W=2) + 1(OpsCenter)
> CFs:                36, 2 indexes
> Partitioner:      Random
> Compaction:   Leveled(we don't want 2x space for housekeeping)
> Caching:          Keys only
> 
> All is pretty much standard apart from the one CF receiving writes in 64K chunks and
having sstable_size_in_mb=100.
> No JNA installed - this is to be fixed soon.
> 
> Checking sysstat/sar I can see 80-90% CPU idle, no anomalies in io and the only change
- network activity spiking. 
> All the nodes before dying had the following on logs:
> > INFO [ScheduledTasks:1] 2012-11-15 21:35:05,512 StatusLogger.java (line 72) MemtablePostFlusher
              1         4         0
> > INFO [ScheduledTasks:1] 2012-11-15 21:35:13,540 StatusLogger.java (line 72) FlushWriter
                      1         3         0
> > INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 72) HintedHandoff
                    1         6         0
> > INFO [ScheduledTasks:1] 2012-11-15 21:36:32,162 StatusLogger.java (line 77) CompactionManager
                5         9
> 
> GCInspector warnings were there too, they went from ~0.8 to 3Gb heap in 5-10mins.
> 
> So, could you please give me a hint on:
> 1. How much GCInspector warnings per hour are considered 'normal'?
> 2. What should be the next thing to check?
> 3. What are the possible failure reasons and how to prevent those?
> 
> Thank you very much in advance,
> Ivan


Mime
View raw message