From Ran Tavory <>
Date Fri, 21 May 2010 21:36:24 GMT
I see some OOM on one of the hosts in the cluster and I wonder if there's a
formula that'll help me calculate what's the required memory setting given
the parameters x,y,z...

In short, I need advice on:
1. How to set up proper heap space and which parameters should I look at
when doing so.
2. Help setting up an alert policy and define some counter measures or sos
steps an admin can take to prevent further degradation of service when
alerts fire.

The OOM is at the row mutation stage and it happens after extensive GC
activity. (log tail below).

The server has 16G physical ram and java heap space 4G. No other significant
processes run on the same server. I actually upped the java heap space to 8G
but it OOMed again...

Most of my settings are the defaults with a few keyspaces and a few CFs in
each KS. Here's the output of cfstats for the largest and most heavily used
CF. (currently reads/writes are stopped but data is there).

Keyspace: outbrain_kvdb
        Read Count: 3392
        Read Latency: 160.33135908018866 ms.
        Write Count: 2005839
        Write Latency: 0.029233923061621595 ms.
        Pending Tasks: 0
                Column Family: KvImpressions
                SSTable count: 8
                Space used (live): 21923629878
                Space used (total): 21923629878
                Memtable Columns Count: 69440
                Memtable Data Size: 9719364
                Memtable Switch Count: 26
                Read Count: 3392
                Read Latency: NaN ms.
                Write Count: 1998821
                Write Latency: 0.018 ms.
                Pending Tasks: 0
                Key cache capacity: 200000
                Key cache size: 11661
                Key cache hit rate: NaN
                Row cache: disabled
                Compacted row minimum size: 302
                Compacted row maximum size: 22387
                Compacted row mean size: 641

I'm also attaching a few graphs of "the incidenst" I hope they help. From
the graphs it looks like:
1. message deserializer pool is behind so maybe taking too much mem. If
graphs are correct, it gets as high as 10m pending before crash.
2. row-read-stage has a high number of pending (4k) so first of all - this
isn't good for performance whether it caused the oom or not, and second,
this may also have taken up heap space and caused the crash.


 INFO [GC inspection] 2010-05-21 00:53:25,885 (line 110) GC
for ConcurrentMarkSweep: 10819 ms, 939992 reclaimed leaving 4312064504 used;
max is 4431216640
 INFO [GC inspection] 2010-05-21 00:53:44,605 (line 110) GC
for ConcurrentMarkSweep: 9672 ms, 673400 reclaimed leaving 4312337208 used;
max is 4431216640
 INFO [GC inspection] 2010-05-21 00:54:23,110 (line 110) GC
for ConcurrentMarkSweep: 9150 ms, 402072 reclaimed leaving 4312609776 used;
max is 4431216640
ERROR [ROW-MUTATION-STAGE:19] 2010-05-21 01:55:37,951
(line 88) Fatal exception in thread Thread[ROW-MUTATION-STAGE:19,5,main]
java.lang.OutOfMemoryError: Java heap space
ERROR [Thread-10] 2010-05-21 01:55:37,951 (line 88)
Fatal exception in thread Thread[Thread-10,5,main]
java.lang.OutOfMemoryError: Java heap space
ERROR [CACHETABLE-TIMER-2] 2010-05-21 01:55:37,951
(line 88) Fatal exception in thread Thread[CACHETABLE-TIMER-2,5,main]
java.lang.OutOfMemoryError: Java heap space

