cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng Ren (JIRA)" <>
Subject [jira] [Created] (CASSANDRA-10150) Cassandra read latency potentially caused by memory leak
Date Thu, 20 Aug 2015 23:18:45 GMT
Cheng Ren created CASSANDRA-10150:

             Summary: Cassandra read latency potentially caused by memory leak
                 Key: CASSANDRA-10150
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: cassandra 2.0.12
            Reporter: Cheng Ren

  We are currently migrating to a new cassandra cluster which is multi-region on ec2.  Our
previous cluster was also on ec2 but only in the east region.  In addition we have upgraded
to cassandra 2.0.12 from 2.0.4 and from ubuntu 12 to 14.

  We are investigating a cassandra latency problem on our new cluster.  The symptom is that
over a long period of time (12-16 hours) the TP90-95 read latency degrades to the point of
being well above our SLA's.  During normal operation our TP95 for a 50key lookup is 75ms,
when fully degraded, we are facing 300ms TP95 latencies.  Doing a rolling restart resolves
the problem.

We are noticing a high correlation between the Old Gen heap usage (and how much is freed up)
and the high latencies.  We are running with a max heap size of 12GB and a max new-gen size
of 2GB.

Below is a chart of the heap usage over a 24 hour period.  Right below it is a chart of TP95
latencies (was a mixed workload of 50 and single key lookups), the third image is a look at
CMS Old Gen memory usage:
Overall heap usage over 24 hrs:
TP95 latencies over 24 hours:
OldGen memory usage over 24 hours:

 You can see from this that the old gen section of our heap is what is using up the majority
of the heap space.  We cannot figure out why the memory is not being collected during a full
GC.  For reference, in our old cassandra cluster, the behavior is that the full GC will clear
up the majority of the heap space.  See image below from an old production node operating


>From heap dump file we found that most memory is consumed by unreachable objects. With
further analysis we were able to see those objects are RMIConnectionImpl$CombinedClassLoader$ClassLoaderWrapper
(holding 4GB of memory) and (holding 2GB) . The only place
we know Cassandra is using RMI is in JMX, but
does anyone has any clue on where else those objects are used? And Why do they take so much
Or It would be great if someone could offer any further debugging tips on the latency or GC

This message was sent by Atlassian JIRA

View raw message