cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gil Ganz (JIRA)" <>
Subject [jira] [Created] (CASSANDRA-8667) ConcurrentMarkSweep loop
Date Thu, 22 Jan 2015 09:25:34 GMT
Gil Ganz created CASSANDRA-8667:

             Summary: ConcurrentMarkSweep loop 
                 Key: CASSANDRA-8667
             Project: Cassandra
          Issue Type: Bug
          Components: Core
         Environment: dse 4.5.4 (cassandra, aws i2.x2large nodes

            Reporter: Gil Ganz
         Attachments:, cassandra.yaml

we are having an issue with nodes that for some reason get into a full gc loop and never recover.
can happen in any node from time to time, but recently we have a node (which was added to
the cluster 2 days) ago that gets this every time.
scenario is like this:
almost no writes/reads going to cluster (<500 reads or writes per second), node is up for
10-20 minutes, doing compactions of big column families and then full gc starts to kick in,
doing loops of 60sec cms gc, even if the heap is not full and the compaction becomes really
slow, node starts to look  down to other nodes.

from system.log :

INFO [ScheduledTasks:1] 2015-01-21 23:02:29,552 (line 116) GC for ConcurrentMarkSweep:
36444 ms for 1 collections, 6933307656 used; max is 10317987840

from gc.log.0:

2015-01-21T23:01:53.072-0800: 1541.643: [CMS2015-01-21T23:01:56.440-0800: 1545.011: [CMS-concurrent-mark:
13.914/13.951 secs] [Times: user=62.39 sys=7.05, real=13.95 secs]
 (concurrent mode failure)CMS: Large block 0x0000000000000000
: 6389749K->6389759K(6389760K), 36.1323980 secs] 10076149K->6685617K(10076160K), [CMS
Perm : 28719K->28719K(47840K)]After GC:
Statistics for BinaryTreeDictionary:
Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree      Height: 0
After GC:
Statistics for BinaryTreeDictionary:
Total Free Space: 24576
Max   Chunk Size: 24576
Number of Blocks: 1
Av.  Block  Size: 24576
Tree      Height: 1
, 36.1327700 secs] [Times: user=40.90 sys=0.00, real=36.14 secs]
Heap after GC invocations=236 (full 19):
 par new generation   total 3686400K, used 295857K [0x000000057ae00000, 0x0000000674e00000,
  eden space 3276800K,   9% used [0x000000057ae00000, 0x000000058ceec4c0, 0x0000000642e00000)
  from space 409600K,   0% used [0x000000065be00000, 0x000000065be00000, 0x0000000674e00000)
  to   space 409600K,   0% used [0x0000000642e00000, 0x0000000642e00000, 0x000000065be00000)
 concurrent mark-sweep generation total 6389760K, used 6389759K [0x0000000674e00000, 0x00000007fae00000,
 concurrent-mark-sweep perm gen total 48032K, used 28719K [0x00000007fae00000, 0x00000007fdce8000,
2015-01-21T23:02:29.204-0800: 1577.776: Total time for which application threads were stopped:
36.1334050 seconds
2015-01-21T23:02:29.239-0800: 1577.810: Total time for which application threads were stopped:
0.0060230 seconds
2015-01-21T23:02:29.239-0800: 1577.811: [GC [1 CMS-initial-mark: 6389759K(6389760K)] 6769792K(10076160K),
0.3112760 secs] [Times: user=0.00 sys=0.00, real=0.31 secs]
2015-01-21T23:02:29.551-0800: 1578.122: Total time for which application threads were stopped:
0.3118580 seconds
2015-01-21T23:02:29.551-0800: 1578.122: [CMS-concurrent-mark-start]
2015-01-21T23:02:29.635-0800: 1578.206: Total time for which application threads were stopped:
0.0060250 seconds

machines are i2.x2large (8 cores, 60gb ram), datadir is on ssd ephemeral, heap size 10g newgen
4gb (following dse recommendation to solve another issue with many parnew gc's going on)
2 dc cluster, 8 nodes in west, 17 nodes in the east (main dc), read heavy (15k writes per
second, at least that much reads per second right now due to the problems but was high as
35k reads per second in the past).

attached yaml and env file

This message was sent by Atlassian JIRA

View raw message