Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 32FAD18BAE for ; Mon, 24 Aug 2015 16:01:49 +0000 (UTC) Received: (qmail 75379 invoked by uid 500); 24 Aug 2015 16:01:47 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 75340 invoked by uid 500); 24 Aug 2015 16:01:47 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 75237 invoked by uid 99); 24 Aug 2015 16:01:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Aug 2015 16:01:46 +0000 Date: Mon, 24 Aug 2015 16:01:46 +0000 (UTC) From: "Jonathan Ellis (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (CASSANDRA-8667) ConcurrentMarkSweep loop MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-8667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis resolved CASSANDRA-8667. --------------------------------------- Resolution: Cannot Reproduce Fix Version/s: (was: 2.0.x) > ConcurrentMarkSweep loop > ------------------------- > > Key: CASSANDRA-8667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8667 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: dse 4.5.4 (cassandra 2.0.11.82), aws i2.x2large nodes > Reporter: Gil Ganz > Attachments: cassandra-env.sh, cassandra.yaml > > > hey > we are having an issue with nodes that for some reason get into a full gc loop and never recover. can happen in any node from time to time, but recently we have a node (which was added to the cluster 2 days) ago that gets this every time. > scenario is like this: > almost no writes/reads going to cluster (<500 reads or writes per second), node is up for 10-20 minutes, doing compactions of big column families and then full gc starts to kick in, doing loops of 60sec cms gc, even if the heap is not full and the compaction becomes really slow, node starts to look down to other nodes. > from system.log : > INFO [ScheduledTasks:1] 2015-01-21 23:02:29,552 GCInspector.java (line 116) GC for ConcurrentMarkSweep: 36444 ms for 1 collections, 6933307656 used; max is 10317987840 > from gc.log.0: > 2015-01-21T23:01:53.072-0800: 1541.643: [CMS2015-01-21T23:01:56.440-0800: 1545.011: [CMS-concurrent-mark: 13.914/13.951 secs] [Times: user=62.39 sys=7.05, real=13.95 secs] > (concurrent mode failure)CMS: Large block 0x0000000000000000 > : 6389749K->6389759K(6389760K), 36.1323980 secs] 10076149K->6685617K(10076160K), [CMS Perm : 28719K->28719K(47840K)]After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > After GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 24576 > Max Chunk Size: 24576 > Number of Blocks: 1 > Av. Block Size: 24576 > Tree Height: 1 > , 36.1327700 secs] [Times: user=40.90 sys=0.00, real=36.14 secs] > Heap after GC invocations=236 (full 19): > par new generation total 3686400K, used 295857K [0x000000057ae00000, 0x0000000674e00000, 0x0000000674e00000) > eden space 3276800K, 9% used [0x000000057ae00000, 0x000000058ceec4c0, 0x0000000642e00000) > from space 409600K, 0% used [0x000000065be00000, 0x000000065be00000, 0x0000000674e00000) > to space 409600K, 0% used [0x0000000642e00000, 0x0000000642e00000, 0x000000065be00000) > concurrent mark-sweep generation total 6389760K, used 6389759K [0x0000000674e00000, 0x00000007fae00000, 0x00000007fae00000) > concurrent-mark-sweep perm gen total 48032K, used 28719K [0x00000007fae00000, 0x00000007fdce8000, 0x0000000800000000) > } > 2015-01-21T23:02:29.204-0800: 1577.776: Total time for which application threads were stopped: 36.1334050 seconds > 2015-01-21T23:02:29.239-0800: 1577.810: Total time for which application threads were stopped: 0.0060230 seconds > 2015-01-21T23:02:29.239-0800: 1577.811: [GC [1 CMS-initial-mark: 6389759K(6389760K)] 6769792K(10076160K), 0.3112760 secs] [Times: user=0.00 sys=0.00, real=0.31 secs] > 2015-01-21T23:02:29.551-0800: 1578.122: Total time for which application threads were stopped: 0.3118580 seconds > 2015-01-21T23:02:29.551-0800: 1578.122: [CMS-concurrent-mark-start] > 2015-01-21T23:02:29.635-0800: 1578.206: Total time for which application threads were stopped: 0.0060250 seconds > machines are i2.x2large (8 cores, 60gb ram), datadir is on ssd ephemeral, heap size 10g newgen 4gb (following dse recommendation to solve another issue with many parnew gc's going on) > 2 dc cluster, 8 nodes in west, 17 nodes in the east (main dc), read heavy (15k writes per second, at least that much reads per second right now due to the problems but was high as 35k reads per second in the past). > attached yaml and env file -- This message was sent by Atlassian JIRA (v6.3.4#6332)