Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D0F9D18CA9 for ; Thu, 4 Jun 2015 20:56:39 +0000 (UTC) Received: (qmail 79520 invoked by uid 500); 4 Jun 2015 20:56:39 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 79490 invoked by uid 500); 4 Jun 2015 20:56:39 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 79467 invoked by uid 99); 4 Jun 2015 20:56:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2015 20:56:39 +0000 Date: Thu, 4 Jun 2015 20:56:39 +0000 (UTC) From: "Benedict (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-9549) Memory leak MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 14573567#comment-14573567 ]=20 Benedict commented on CASSANDRA-9549: ------------------------------------- Thanks. Unfortunately that does not seem to be the complete log history. It= would help a great deal to have logs from when the node actually started u= p. I can make an educated guess, though: it looks like the node was OOMing due= to normal operational reasons (or perhaps some other issue, we cannot say)= , and we recently modified behaviour in this scenario to trigger a shutdown= of the host. Unfortunately, it seems that the OOM is somehow delaying the = shutdown from completing, or perhaps there is some other issue. Certainly t= he JVM thinks it is shutting down. The strange thing is that the shutdown hook must still have been run, since= that is the only way the executor service could be shutdown, only we ask t= he shutdown hook to be removed in this event. More complete logs would help us. Increasing your heap space may fix the underlying problem. It may be that t= here is another underlying issue causing your heap to explode. To establish= this we would need a heap dump during one of these events. If, however, yo= u make extensive use of CQL row deletions, or CQL collections and perform o= verwrites of the entire collection, it may be that you are encountering CAS= SANDRA-9486, in which case a patch is available for that, and will be fixed= in 2.1.6 to be released shortly. > Memory leak=20 > ------------ > > Key: CASSANDRA-9549 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nod= es, 2 cores 7.5G memory, 800G platter for cassandra data, root partition an= d commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone= , 1 replica/zone > JVM: /usr/java/jdk1.8.0_40/jre/bin/java=20 > JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.= jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriori= tyPolicy=3D42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss25= 6k -XX:StringTableSize=3D1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -= XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=3D8 -XX:MaxTenuringThreshold= =3D1 -XX:CMSInitiatingOccupancyFraction=3D75 -XX:+UseCMSInitiatingOccupancy= Only -XX:+UseTLAB -XX:CompileCommandFile=3D/etc/cassandra/conf/hotspot_comp= iler -XX:CMSWaitDuration=3D10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CM= SEdenChunksRecordAlways -XX:CMSWaitDuration=3D10000 -XX:+UseCondCardMark -D= java.net.preferIPv4Stack=3Dtrue -Dcom.sun.management.jmxremote.port=3D7199 = -Dcom.sun.management.jmxremote.rmi.port=3D7199 -Dcom.sun.management.jmxremo= te.ssl=3Dfalse -Dcom.sun.management.jmxremote.authenticate=3Dfalse -Dlogbac= k.configurationFile=3Dlogback.xml -Dcassandra.logdir=3D/var/log/cassandra -= Dcassandra.storagedir=3D -Dcassandra-pidfile=3D/var/run/cassandra/cassandra= .pid=20 > Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/= Linux > Reporter: Ivar Thorson > Priority: Critical > Fix For: 2.1.x > > Attachments: c4_system.log, cassandra.yaml, cpu-load.png, memoryu= se.png, ref-java-errors.jpeg, suspect.png, two-loads.png > > > We have been experiencing a severe memory leak with Cassandra 2.1.5 that,= over the period of a couple of days, eventually consumes all of the availa= ble JVM heap space, putting the JVM into GC hell where it keeps trying CMS = collection but can't free up any heap space. This pattern happens for every= node in our cluster and is requiring rolling cassandra restarts just to ke= ep the cluster running. We have upgraded the cluster per Datastax docs from= the 2.0 branch a couple of months ago and have been using the data from th= is cluster for more than a year without problem. > As the heap fills up with non-GC-able objects, the CPU/OS load average gr= ows along with it. Heap dumps reveal an increasing number of java.util.conc= urrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day = period, and watched the number of Node objects go from 4M, to 19M, to 36M, = and eventually about 65M objects before the node stops responding. The scre= en capture of our heap dump is from the 19M measurement. > Load on the cluster is minimal. We can see this effect even with only a h= andful of writes per second. (See attachments for Opscenter snapshots durin= g very light loads and heavier loads). Even with only 5 reads a sec we see = this behavior. > Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK= detected" messages: > {code} > ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Err= or when closing class org.apache.cassandra.io.sstable.SSTableReader$Instanc= eTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 > java.util.concurrent.RejectedExecutionException: Task java.util.concurren= t.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from or= g.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d= 6[Terminated, pool size =3D 0, active threads =3D 0, queued tasks =3D 0, co= mpleted tasks =3D 1644] > {code} > {code} > ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DE= TECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df= 92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTi= dy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before = the reference was garbage collected > {code} > This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)