cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Wille <rwi...@fold3.com>
Subject Help understanding aftermath of death by GC
Date Tue, 31 Mar 2015 12:22:59 GMT
I moved my site over to Cassandra a few months ago, and everything has been just peachy until
a few hours ago (yes, it would be in the middle of the night) when my entire cluster suffered
death by GC. By death by GC, I mean this:

[rwille@cas031 cassandra]$ grep GC system.log | head -5
 INFO [ScheduledTasks:1] 2015-03-31 02:49:57,480 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
30219 ms for 1 collections, 7664429440 used; max is 8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:50:32,180 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
30673 ms for 1 collections, 7707488712 used; max is 8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:51:05,108 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
30453 ms for 1 collections, 7693634672 used; max is 8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:51:38,787 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
30691 ms for 1 collections, 7686028472 used; max is 8329887744
 INFO [ScheduledTasks:1] 2015-03-31 02:52:12,452 GCInspector.java (line 116) GC for ConcurrentMarkSweep:
30346 ms for 1 collections, 7701401200 used; max is 8329887744

I’m pretty sure I know what triggered it. When I first started developing to Cassandra,
I found the IN clause to be supremely useful, and I used it a lot. Later I figured out it
was a bad thing and repented and fixed my code, but I missed one spot. A maintenance task
spent a couple of hours repeatedly issuing queries with IN clauses with 1000 items in the
clause and the whole system went belly up.

I get that my bad queries caused Cassandra to require more heap than was available, but here’s
what I don’t understand. When the crap hit the fan, the maintenance task died due to a timeout
error, but the cluster never recovered. I would have expected that when I was no longer issuing
the bad queries, that the heap would get cleaned up and life would resume to normal. Can anybody
help me understand why Cassandra wouldn’t recover? How is it that GC pressure will cause
heap to be permanently uncollectable?

This makes me pretty worried. I can fix my code, but I don’t really have control over spikes.
If memory pressure spikes, I can tolerate some timeouts and errors, but if it can’t come
back when the pressure is gone, that seems pretty bad.

Any insights would be greatly appreciated

Robert


Mime
View raw message