cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jorge Rodriguez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10150) Cassandra read latency potentially caused by memory leak
Date Fri, 21 Aug 2015 14:09:46 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706772#comment-14706772
] 

Jorge Rodriguez commented on CASSANDRA-10150:
---------------------------------------------

We came across this thread from Benedict on the jmx-dev mailing list yesterday, and we implemented
the workaround he recommended here: http://mail.openjdk.java.net/pipermail/jmx-dev/2014-February/000585.html
Which is to enable the flag: "CMSClassUnloadingEnabled"

Since we enabled this flag yesterday we are not seeing the memory leak.  Performance also
hasn't been impacted by this so far either it seems.

> Cassandra read latency potentially caused by memory leak
> --------------------------------------------------------
>
>                 Key: CASSANDRA-10150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: cassandra 2.0.12
>            Reporter: Cheng Ren
>
>   We are currently migrating to a new cassandra cluster which is multi-region on ec2.
 Our previous cluster was also on ec2 but only in the east region.  In addition we have upgraded
to cassandra 2.0.12 from 2.0.4 and from ubuntu 12 to 14.
>   We are investigating a cassandra latency problem on our new cluster.  The symptom is
that over a long period of time (12-16 hours) the TP90-95 read latency degrades to the point
of being well above our SLA's.  During normal operation our TP95 for a 50key lookup is 75ms,
when fully degraded, we are facing 300ms TP95 latencies.  Doing a rolling restart resolves
the problem.
> We are noticing a high correlation between the Old Gen heap usage (and how much is freed
up) and the high latencies.  We are running with a max heap size of 12GB and a max new-gen
size of 2GB.
> Below is a chart of the heap usage over a 24 hour period.  Right below it is a chart
of TP95 latencies (was a mixed workload of 50 and single key lookups), the third image is
a look at CMS Old Gen memory usage:
> Overall heap usage over 24 hrs:
> !https://dl.dropboxusercontent.com/u/303980955/1.png|height=300,width=500!
> TP95 latencies over 24 hours:
> !https://dl.dropboxusercontent.com/u/303980955/2.png|height=300,width=500!
> OldGen memory usage over 24 hours:
> !https://dl.dropboxusercontent.com/u/303980955/3.png|height=300,width=500!
>  You can see from this that the old gen section of our heap is what is using up the majority
of the heap space.  We cannot figure out why the memory is not being collected during a full
GC.  For reference, in our old cassandra cluster, the behavior is that the full GC will clear
up the majority of the heap space.  See image below from an old production node operating
normally:
> !https://dl.dropboxusercontent.com/u/303980955/4.png|height=300,width=500!
> From heap dump file we found that most memory is consumed by unreachable objects. With
further analysis we were able to see those objects are RMIConnectionImpl$CombinedClassLoader$ClassLoaderWrapper
(holding 4GB of memory) and java.security.ProtectionDomain (holding 2GB) . The only place
we know Cassandra is using RMI is in JMX, but
> does anyone has any clue on where else those objects are used? And Why do they take so
much memory?
> Or It would be great if someone could offer any further debugging tips on the latency
or GC issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message