cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Russ Hatch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8285) OOME in Cassandra 2.0.11
Date Wed, 12 Nov 2014 20:40:56 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208642#comment-14208642
] 

Russ Hatch commented on CASSANDRA-8285:
---------------------------------------

I used MAT to have a look at the heap contents and try to figure out what's going on. The
heap appears to be dominated by a single MemTable consuming about 97% of heap. The max memtable
size is definitely being exceeded. So something quirky with flushing seems like a good suspect.

Here's the startup message acknowledging the max size (presumably defaulted to 1/4 heap by
not being set):
{noformat}
 INFO [main] 2014-11-10 17:09:29,348 DatabaseDescriptor.java (line 268) Global memtable threshold
is enabled at 253MB
{noformat}

Due to dropped messages and full gc's StatusLogger was dumping some (maybe) useful information
to the logs that shows some blocked FlushWriter threads -- I'm not sure whether this is routine
or not, might be worth reviewing. Here we see the FlushWriter 'All Time Blocked' threads value
spike suddenly, followed by a large increase in pending compactions (if I'm interpreting the
logs correctly):
{noformat}
                                                                                         
                Active    Pending    Completed      Blocked     All Time Blocked
 INFO [ScheduledTasks:1] 2014-11-11 08:21:07,549 StatusLogger.java (line 70) FlushWriter 
                     0         0           4993         0                 1
 INFO [ScheduledTasks:1] 2014-11-11 08:21:07,636 StatusLogger.java (line 70) CompactionExecutor
               1         6           4639         0                 0
 INFO [ScheduledTasks:1] 2014-11-11 08:21:07,636 StatusLogger.java (line 79) CompactionManager
                1         8
 INFO [ScheduledTasks:1] 2014-11-12 09:57:10,590 StatusLogger.java (line 70) FlushWriter 
                     0         0            111         0                28
 INFO [ScheduledTasks:1] 2014-11-12 09:57:10,719 StatusLogger.java (line 70) CompactionExecutor
               1        19             26         0                 0
 INFO [ScheduledTasks:1] 2014-11-12 09:57:10,720 StatusLogger.java (line 79) CompactionManager
                1        21
 INFO [ScheduledTasks:1] 2014-11-12 09:57:24,648 StatusLogger.java (line 70) FlushWriter 
                     0         0            111         0                28
{noformat}

So it seems basically an unflushed memtable is consuming excessive heap and causing the constant
full GC's and the eventual OOM. Now the question is why is the memtable not being flushed?

> OOME in Cassandra 2.0.11
> ------------------------
>
>                 Key: CASSANDRA-8285
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8285
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.0.11 + java-driver 2.0.8-SNAPSHOT
> Cassandra 2.0.11 + ruby-driver 1.0-beta
>            Reporter: Pierre Laporte
>            Assignee: Aleksey Yeschenko
>         Attachments: OOME_node_system.log, gc.log.gz, heap-usage-after-gc-zoom.png, heap-usage-after-gc.png
>
>
> We ran drivers 3-days endurance tests against Cassandra 2.0.11 and C* crashed with an
OOME.  This happened both with ruby-driver 1.0-beta and java-driver 2.0.8-snapshot.
> Attached are :
> | OOME_node_system.log | The system.log of one Cassandra node that crashed |
> | gc.log.gz | The GC log on the same node |
> | heap-usage-after-gc.png | The heap occupancy evolution after every GC cycle |
> | heap-usage-after-gc-zoom.png | A focus on when things start to go wrong |
> Workload :
> Our test executes 5 CQL statements (select, insert, select, delete, select) for a given
unique id, during 3 days, using multiple threads.  There is not change in the workload during
the test.
> Symptoms :
> In the attached log, it seems something starts in Cassandra between 2014-11-06 10:29:22
and 2014-11-06 10:45:32.  This causes an allocation that fills the heap.  We eventually get
stuck in a Full GC storm and get an OOME in the logs.
> I have run the java-driver tests against Cassandra 1.2.19 and 2.1.1.  The error does
not occur.  It seems specific to 2.0.11.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message