cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-9549) Memory leak
Date Thu, 04 Jun 2015 19:43:38 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573456#comment-14573456
] 

Benedict commented on CASSANDRA-9549:
-------------------------------------

It's possible there is a script in their envrionment running periodically, asking the servers
to drain. There are really very few ways for that executor service to be shutdown (assuming
it's the executor submitted to inside of the method throwing the REE; it's hard to say with
absolute certainty because the stack trace has been compressed due to the frequency of the
error generation): the shutdown hook indicating the VM is terminating, or the drain() command.

As I said, though: more info, means we can say with greater certainty. That full log history
since restart would be a great start. A thread dump would be the natural follow on if that
was not sufficiently helpful.

> Memory leak 
> ------------
>
>                 Key: CASSANDRA-9549
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores
7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS
with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone
> JVM: /usr/java/jdk1.8.0_40/jre/bin/java 
> JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError
-Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=10000
-XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000
-XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
-Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir=
-Dcassandra-pidfile=/var/run/cassandra/cassandra.pid 
> Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Ivar Thorson
>            Priority: Critical
>             Fix For: 2.1.x
>
>         Attachments: cassandra.yaml, cpu-load.png, memoryuse.png, suspect.png, two-loads.png
>
>
> We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period
of a couple of days, eventually consumes all of the available JVM heap space, putting the
JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This
pattern happens for every node in our cluster and is requiring rolling cassandra restarts
just to keep the cluster running. We have upgraded the cluster per Datastax docs from the
2.0 branch a couple of months ago and have been using the data from this cluster for more
than a year without problem.
> As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with
it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node
objects. We took heap dumps over a 2 day period, and watched the number of Node objects go
from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding.
The screen capture of our heap dump is from the 19M measurement.
> Load on the cluster is minimal. We can see this effect even with only a handful of writes
per second. (See attachments for Opscenter snapshots during very light loads and heavier loads).
Even with only 5 reads a sec we see this behavior.
> Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK detected" messages:
> {code}
> ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing
class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
> java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31
rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644]
> {code}
> {code}
> ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151
was not released before the reference was garbage collected
> {code}
> This might be related to [CASSANDRA-8723]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message