Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A929E1865F for ; Thu, 4 Jun 2015 19:03:38 +0000 (UTC) Received: (qmail 83562 invoked by uid 500); 4 Jun 2015 19:03:38 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 83522 invoked by uid 500); 4 Jun 2015 19:03:38 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 83510 invoked by uid 99); 4 Jun 2015 19:03:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Jun 2015 19:03:38 +0000 Date: Thu, 4 Jun 2015 19:03:38 +0000 (UTC) From: "Benedict (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-9549) Memory leak MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 14573400#comment-14573400 ]=20 Benedict commented on CASSANDRA-9549: ------------------------------------- Looks like you've called drain(), but the server is still up and trying to = do work... A full system log (back until node startup) could help, but this situation = should be pretty atypical. Restarting the node should be enough to correct = it. > Memory leak=20 > ------------ > > Key: CASSANDRA-9549 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9549 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nod= es, 2 cores 7.5G memory, 800G platter for cassandra data, root partition an= d commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone= , 1 replica/zone > JVM: /usr/java/jdk1.8.0_40/jre/bin/java=20 > JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.= jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriori= tyPolicy=3D42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss25= 6k -XX:StringTableSize=3D1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -= XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=3D8 -XX:MaxTenuringThreshold= =3D1 -XX:CMSInitiatingOccupancyFraction=3D75 -XX:+UseCMSInitiatingOccupancy= Only -XX:+UseTLAB -XX:CompileCommandFile=3D/etc/cassandra/conf/hotspot_comp= iler -XX:CMSWaitDuration=3D10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CM= SEdenChunksRecordAlways -XX:CMSWaitDuration=3D10000 -XX:+UseCondCardMark -D= java.net.preferIPv4Stack=3Dtrue -Dcom.sun.management.jmxremote.port=3D7199 = -Dcom.sun.management.jmxremote.rmi.port=3D7199 -Dcom.sun.management.jmxremo= te.ssl=3Dfalse -Dcom.sun.management.jmxremote.authenticate=3Dfalse -Dlogbac= k.configurationFile=3Dlogback.xml -Dcassandra.logdir=3D/var/log/cassandra -= Dcassandra.storagedir=3D -Dcassandra-pidfile=3D/var/run/cassandra/cassandra= .pid=20 > Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/= Linux > Reporter: Ivar Thorson > Priority: Critical > Fix For: 2.1.x > > Attachments: cassandra.yaml, cpu-load.png, memoryuse.png, suspect= .png, two-loads.png > > > We have been experiencing a severe memory leak with Cassandra 2.1.5 that,= over the period of a couple of days, eventually consumes all of the availa= ble JVM heap space, putting the JVM into GC hell where it keeps trying CMS = collection but can't free up any heap space. This pattern happens for every= node in our cluster and is requiring rolling cassandra restarts just to ke= ep the cluster running. We have upgraded the cluster per Datastax docs from= the 2.0 branch a couple of months ago and have been using the data from th= is cluster for more than a year without problem. > As the heap fills up with non-GC-able objects, the CPU/OS load average gr= ows along with it. Heap dumps reveal an increasing number of java.util.conc= urrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day = period, and watched the number of Node objects go from 4M, to 19M, to 36M, = and eventually about 65M objects before the node stops responding. The scre= en capture of our heap dump is from the 19M measurement. > Load on the cluster is minimal. We can see this effect even with only a h= andful of writes per second. (See attachments for Opscenter snapshots durin= g very light loads and heavier loads). Even with only 5 reads a sec we see = this behavior. > Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK= detected" messages: > ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Err= or when closing class org.apache.cassandra.io.sstable.SSTableReader$Instanc= eTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150 > java.util.concurrent.RejectedExecutionException: Task java.util.concurren= t.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from or= g.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d= 6[Terminated, pool size =3D 0, active threads =3D 0, queued tasks =3D 0, co= mpleted tasks =3D 1644] > ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DE= TECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df= 92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTi= dy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before = the reference was garbage collected > This might be related to [CASSANDRA-8723]? -- This message was sent by Atlassian JIRA (v6.3.4#6332)