Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Thu, 4 Jun 2015 19:03:38 +0000 (UTC)
From: "Benedict (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12835471.1433439713000.14645.1433444618434@Atlassian.JIRA>
In-Reply-To: <JIRA.12835471.1433439713000@Atlassian.JIRA>
References: <JIRA.12835471.1433439713000@Atlassian.JIRA>
 <JIRA.12835471.1433439713859@arcas>
Subject: [jira] [Commented] (CASSANDRA-9549) Memory leak
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
14573400#comment-14573400 ]=20

Benedict commented on CASSANDRA-9549:
-------------------------------------

Looks like you've called drain(), but the server is still up and trying to =
do work...

A full system log (back until node startup) could help, but this situation =
should be pretty atypical. Restarting the node should be enough to correct =
it.

> Memory leak=20
> ------------
>
>                 Key: CASSANDRA-9549
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nod=
es, 2 cores 7.5G memory, 800G platter for cassandra data, root partition an=
d commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone=
, 1 replica/zone
> JVM: /usr/java/jdk1.8.0_40/jre/bin/java=20
> JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.=
jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriori=
tyPolicy=3D42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss25=
6k -XX:StringTableSize=3D1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -=
XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=3D8 -XX:MaxTenuringThreshold=
=3D1 -XX:CMSInitiatingOccupancyFraction=3D75 -XX:+UseCMSInitiatingOccupancy=
Only -XX:+UseTLAB -XX:CompileCommandFile=3D/etc/cassandra/conf/hotspot_comp=
iler -XX:CMSWaitDuration=3D10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CM=
SEdenChunksRecordAlways -XX:CMSWaitDuration=3D10000 -XX:+UseCondCardMark -D=
java.net.preferIPv4Stack=3Dtrue -Dcom.sun.management.jmxremote.port=3D7199 =
-Dcom.sun.management.jmxremote.rmi.port=3D7199 -Dcom.sun.management.jmxremo=
te.ssl=3Dfalse -Dcom.sun.management.jmxremote.authenticate=3Dfalse -Dlogbac=
k.configurationFile=3Dlogback.xml -Dcassandra.logdir=3D/var/log/cassandra -=
Dcassandra.storagedir=3D -Dcassandra-pidfile=3D/var/run/cassandra/cassandra=
.pid=20
> Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/=
Linux
>            Reporter: Ivar Thorson
>            Priority: Critical
>             Fix For: 2.1.x
>
>         Attachments: cassandra.yaml, cpu-load.png, memoryuse.png, suspect=
.png, two-loads.png
>
>
> We have been experiencing a severe memory leak with Cassandra 2.1.5 that,=
 over the period of a couple of days, eventually consumes all of the availa=
ble JVM heap space, putting the JVM into GC hell where it keeps trying CMS =
collection but can't free up any heap space. This pattern happens for every=
 node in our cluster and is requiring rolling cassandra restarts just to ke=
ep the cluster running. We have upgraded the cluster per Datastax docs from=
 the 2.0 branch a couple of months ago and have been using the data from th=
is cluster for more than a year without problem.
> As the heap fills up with non-GC-able objects, the CPU/OS load average gr=
ows along with it. Heap dumps reveal an increasing number of java.util.conc=
urrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day =
period, and watched the number of Node objects go from 4M, to 19M, to 36M, =
and eventually about 65M objects before the node stops responding. The scre=
en capture of our heap dump is from the 19M measurement.
> Load on the cluster is minimal. We can see this effect even with only a h=
andful of writes per second. (See attachments for Opscenter snapshots durin=
g very light loads and heavier loads). Even with only 5 reads a sec we see =
this behavior.
> Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK=
 detected" messages:
> ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Err=
or when closing class org.apache.cassandra.io.sstable.SSTableReader$Instanc=
eTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
> java.util.concurrent.RejectedExecutionException: Task java.util.concurren=
t.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from or=
g.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d=
6[Terminated, pool size =3D 0, active threads =3D 0, queued tasks =3D 0, co=
mpleted tasks =3D 1644]
> ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DE=
TECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df=
92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTi=
dy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before =
the reference was garbage collected
> This might be related to [CASSANDRA-8723]?


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)