Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Thu, 4 Jun 2015 20:56:39 +0000 (UTC)
From: "Benedict (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12835471.1433439713000.15801.1433451399564@Atlassian.JIRA>
In-Reply-To: <JIRA.12835471.1433439713000@Atlassian.JIRA>
References: <JIRA.12835471.1433439713000@Atlassian.JIRA>
 <JIRA.12835471.1433439713859@arcas>
Subject: [jira] [Commented] (CASSANDRA-9549) Memory leak
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CASSANDRA-9549?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
14573567#comment-14573567 ]=20

Benedict commented on CASSANDRA-9549:
-------------------------------------

Thanks. Unfortunately that does not seem to be the complete log history. It=
 would help a great deal to have logs from when the node actually started u=
p.

I can make an educated guess, though: it looks like the node was OOMing due=
 to normal operational reasons (or perhaps some other issue, we cannot say)=
, and we recently modified behaviour in this scenario to trigger a shutdown=
 of the host. Unfortunately, it seems that the OOM is somehow delaying the =
shutdown from completing, or perhaps there is some other issue. Certainly t=
he JVM thinks it is shutting down.

The strange thing is that the shutdown hook must still have been run, since=
 that is the only way the executor service could be shutdown, only we ask t=
he shutdown hook to be removed in this event.

More complete logs would help us.

Increasing your heap space may fix the underlying problem. It may be that t=
here is another underlying issue causing your heap to explode. To establish=
 this we would need a heap dump during one of these events. If, however, yo=
u make extensive use of CQL row deletions, or CQL collections and perform o=
verwrites of the entire collection, it may be that you are encountering CAS=
SANDRA-9486, in which case a patch is available for that, and will be fixed=
 in 2.1.6 to be released shortly.

> Memory leak=20
> ------------
>
>                 Key: CASSANDRA-9549
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9549
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nod=
es, 2 cores 7.5G memory, 800G platter for cassandra data, root partition an=
d commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone=
, 1 replica/zone
> JVM: /usr/java/jdk1.8.0_40/jre/bin/java=20
> JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.=
jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriori=
tyPolicy=3D42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss25=
6k -XX:StringTableSize=3D1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -=
XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=3D8 -XX:MaxTenuringThreshold=
=3D1 -XX:CMSInitiatingOccupancyFraction=3D75 -XX:+UseCMSInitiatingOccupancy=
Only -XX:+UseTLAB -XX:CompileCommandFile=3D/etc/cassandra/conf/hotspot_comp=
iler -XX:CMSWaitDuration=3D10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CM=
SEdenChunksRecordAlways -XX:CMSWaitDuration=3D10000 -XX:+UseCondCardMark -D=
java.net.preferIPv4Stack=3Dtrue -Dcom.sun.management.jmxremote.port=3D7199 =
-Dcom.sun.management.jmxremote.rmi.port=3D7199 -Dcom.sun.management.jmxremo=
te.ssl=3Dfalse -Dcom.sun.management.jmxremote.authenticate=3Dfalse -Dlogbac=
k.configurationFile=3Dlogback.xml -Dcassandra.logdir=3D/var/log/cassandra -=
Dcassandra.storagedir=3D -Dcassandra-pidfile=3D/var/run/cassandra/cassandra=
.pid=20
> Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/=
Linux
>            Reporter: Ivar Thorson
>            Priority: Critical
>             Fix For: 2.1.x
>
>         Attachments: c4_system.log, cassandra.yaml, cpu-load.png, memoryu=
se.png, ref-java-errors.jpeg, suspect.png, two-loads.png
>
>
> We have been experiencing a severe memory leak with Cassandra 2.1.5 that,=
 over the period of a couple of days, eventually consumes all of the availa=
ble JVM heap space, putting the JVM into GC hell where it keeps trying CMS =
collection but can't free up any heap space. This pattern happens for every=
 node in our cluster and is requiring rolling cassandra restarts just to ke=
ep the cluster running. We have upgraded the cluster per Datastax docs from=
 the 2.0 branch a couple of months ago and have been using the data from th=
is cluster for more than a year without problem.
> As the heap fills up with non-GC-able objects, the CPU/OS load average gr=
ows along with it. Heap dumps reveal an increasing number of java.util.conc=
urrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day =
period, and watched the number of Node objects go from 4M, to 19M, to 36M, =
and eventually about 65M objects before the node stops responding. The scre=
en capture of our heap dump is from the 19M measurement.
> Load on the cluster is minimal. We can see this effect even with only a h=
andful of writes per second. (See attachments for Opscenter snapshots durin=
g very light loads and heavier loads). Even with only 5 reads a sec we see =
this behavior.
> Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK=
 detected" messages:
> {code}
> ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Err=
or when closing class org.apache.cassandra.io.sstable.SSTableReader$Instanc=
eTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
> java.util.concurrent.RejectedExecutionException: Task java.util.concurren=
t.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from or=
g.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d=
6[Terminated, pool size =3D 0, active threads =3D 0, queued tasks =3D 0, co=
mpleted tasks =3D 1644]
> {code}
> {code}
> ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DE=
TECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df=
92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTi=
dy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before =
the reference was garbage collected
> {code}
> This might be related to [CASSANDRA-8723]?


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)