Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Mon, 24 Aug 2015 16:01:46 +0000 (UTC)
From: "Jonathan Ellis (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12769237.1421918678000.147040.1440432106879@Atlassian.JIRA>
In-Reply-To: <JIRA.12769237.1421918678000@Atlassian.JIRA>
References: <JIRA.12769237.1421918678000@Atlassian.JIRA>
 <JIRA.12769237.1421918678550@arcas>
Subject: [jira] [Resolved] (CASSANDRA-8667) ConcurrentMarkSweep loop
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/CASSANDRA-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-8667.
---------------------------------------
       Resolution: Cannot Reproduce
    Fix Version/s:     (was: 2.0.x)

> ConcurrentMarkSweep loop 
> -------------------------
>
>                 Key: CASSANDRA-8667
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8667
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: dse 4.5.4 (cassandra 2.0.11.82), aws i2.x2large nodes
>            Reporter: Gil Ganz
>         Attachments: cassandra-env.sh, cassandra.yaml
>
>
> hey
> we are having an issue with nodes that for some reason get into a full gc loop and never recover. can happen in any node from time to time, but recently we have a node (which was added to the cluster 2 days) ago that gets this every time.
> scenario is like this:
> almost no writes/reads going to cluster (<500 reads or writes per second), node is up for 10-20 minutes, doing compactions of big column families and then full gc starts to kick in, doing loops of 60sec cms gc, even if the heap is not full and the compaction becomes really slow, node starts to look  down to other nodes.
> from system.log :
> INFO [ScheduledTasks:1] 2015-01-21 23:02:29,552 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 36444 ms for 1 collections, 6933307656 used; max is 10317987840
> from gc.log.0:
> 2015-01-21T23:01:53.072-0800: 1541.643: [CMS2015-01-21T23:01:56.440-0800: 1545.011: [CMS-concurrent-mark: 13.914/13.951 secs] [Times: user=62.39 sys=7.05, real=13.95 secs]
>  (concurrent mode failure)CMS: Large block 0x0000000000000000
> : 6389749K->6389759K(6389760K), 36.1323980 secs] 10076149K->6685617K(10076160K), [CMS Perm : 28719K->28719K(47840K)]After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> After GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 24576
> Max   Chunk Size: 24576
> Number of Blocks: 1
> Av.  Block  Size: 24576
> Tree      Height: 1
> , 36.1327700 secs] [Times: user=40.90 sys=0.00, real=36.14 secs]
> Heap after GC invocations=236 (full 19):
>  par new generation   total 3686400K, used 295857K [0x000000057ae00000, 0x0000000674e00000, 0x0000000674e00000)
>   eden space 3276800K,   9% used [0x000000057ae00000, 0x000000058ceec4c0, 0x0000000642e00000)
>   from space 409600K,   0% used [0x000000065be00000, 0x000000065be00000, 0x0000000674e00000)
>   to   space 409600K,   0% used [0x0000000642e00000, 0x0000000642e00000, 0x000000065be00000)
>  concurrent mark-sweep generation total 6389760K, used 6389759K [0x0000000674e00000, 0x00000007fae00000, 0x00000007fae00000)
>  concurrent-mark-sweep perm gen total 48032K, used 28719K [0x00000007fae00000, 0x00000007fdce8000, 0x0000000800000000)
> }
> 2015-01-21T23:02:29.204-0800: 1577.776: Total time for which application threads were stopped: 36.1334050 seconds
> 2015-01-21T23:02:29.239-0800: 1577.810: Total time for which application threads were stopped: 0.0060230 seconds
> 2015-01-21T23:02:29.239-0800: 1577.811: [GC [1 CMS-initial-mark: 6389759K(6389760K)] 6769792K(10076160K), 0.3112760 secs] [Times: user=0.00 sys=0.00, real=0.31 secs]
> 2015-01-21T23:02:29.551-0800: 1578.122: Total time for which application threads were stopped: 0.3118580 seconds
> 2015-01-21T23:02:29.551-0800: 1578.122: [CMS-concurrent-mark-start]
> 2015-01-21T23:02:29.635-0800: 1578.206: Total time for which application threads were stopped: 0.0060250 seconds
> machines are i2.x2large (8 cores, 60gb ram), datadir is on ssd ephemeral, heap size 10g newgen 4gb (following dse recommendation to solve another issue with many parnew gc's going on)
> 2 dc cluster, 8 nodes in west, 17 nodes in the east (main dc), read heavy (15k writes per second, at least that much reads per second right now due to the problems but was high as 35k reads per second in the past).
> attached yaml and env file


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)