Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8F7C818D31 for ; Thu, 22 Oct 2015 13:08:31 +0000 (UTC) Received: (qmail 73294 invoked by uid 500); 22 Oct 2015 13:08:28 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 73187 invoked by uid 500); 22 Oct 2015 13:08:28 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 72822 invoked by uid 99); 22 Oct 2015 13:08:28 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Oct 2015 13:08:28 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E466A2C044E for ; Thu, 22 Oct 2015 13:08:27 +0000 (UTC) Date: Thu, 22 Oct 2015 13:08:27 +0000 (UTC) From: "Robbie Strickland (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap after long GC pause MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969125#comment-14969125 ] Robbie Strickland edited comment on CASSANDRA-10449 at 10/22/15 1:08 PM: ------------------------------------------------------------------------- I decided to try upgrading to 2.1.11 to see if the issue was resolved by CASSANDRA-9681. The node has been joining for over 24 hours, even though it appears to have finished streaming after about 6 hours: {noformat} ubuntu@eventcass4x087:~$ nodetool netstats | grep -v 100% Mode: JOINING Bootstrap 7047c510-7732-11e5-a7e7-63f53bbd2778 Receiving 171 files, 95313491312 bytes total. Already received 171 files, 95313491312 bytes total Receiving 165 files, 78860134041 bytes total. Already received 165 files, 78860134041 bytes total Receiving 158 files, 77709354374 bytes total. Already received 158 files, 77709354374 bytes total Receiving 184 files, 106710570690 bytes total. Already received 184 files, 106710570690 bytes total Receiving 136 files, 35699286217 bytes total. Already received 136 files, 35699286217 bytes total Receiving 169 files, 53498180215 bytes total. Already received 169 files, 53498180215 bytes total Receiving 197 files, 129020987979 bytes total. Already received 197 files, 129020987979 bytes total Receiving 196 files, 113904035360 bytes total. Already received 196 files, 113904035360 bytes total Receiving 172 files, 47685647028 bytes total. Already received 172 files, 47685647028 bytes total Read Repair Statistics: Attempted: 0 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Commands n/a 1 0 Responses n/a 0 83743675 {noformat} It doesn't appear to still be building indexes either: {noformat} ubuntu@eventcass4x087:~$ nodetool compactionstats pending tasks: 2 compaction type keyspace table completed total unit progress Compaction prod_analytics_events wuevents 163704673 201033961 bytes 81.43% Active compaction remaining time : n/a {noformat} So I'm not sure why it's still joining. Any thoughts? was (Author: rstrickland): I decided to try upgrading to 2.1.11 to see if the issue was resolved by CASSANDRA-9681. The node has been joining for over 24 hours, even though it appears to have finished streaming after about 6 hours: {{noformat}} ubuntu@eventcass4x087:~$ nodetool netstats | grep -v 100% Mode: JOINING Bootstrap 7047c510-7732-11e5-a7e7-63f53bbd2778 Receiving 171 files, 95313491312 bytes total. Already received 171 files, 95313491312 bytes total Receiving 165 files, 78860134041 bytes total. Already received 165 files, 78860134041 bytes total Receiving 158 files, 77709354374 bytes total. Already received 158 files, 77709354374 bytes total Receiving 184 files, 106710570690 bytes total. Already received 184 files, 106710570690 bytes total Receiving 136 files, 35699286217 bytes total. Already received 136 files, 35699286217 bytes total Receiving 169 files, 53498180215 bytes total. Already received 169 files, 53498180215 bytes total Receiving 197 files, 129020987979 bytes total. Already received 197 files, 129020987979 bytes total Receiving 196 files, 113904035360 bytes total. Already received 196 files, 113904035360 bytes total Receiving 172 files, 47685647028 bytes total. Already received 172 files, 47685647028 bytes total Read Repair Statistics: Attempted: 0 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Commands n/a 1 0 Responses n/a 0 83743675 {{noformat}} It doesn't appear to still be building indexes either: {{noformat}} ubuntu@eventcass4x087:~$ nodetool compactionstats pending tasks: 2 compaction type keyspace table completed total unit progress Compaction prod_analytics_events wuevents 163704673 201033961 bytes 81.43% Active compaction remaining time : n/a {{noformat}} So I'm not sure why it's still joining. Any thoughts? > OOM on bootstrap after long GC pause > ------------------------------------ > > Key: CASSANDRA-10449 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10449 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Ubuntu 14.04, AWS > Reporter: Robbie Strickland > Labels: gc > Fix For: 2.1.x > > Attachments: GCpath.txt, heap_dump.png, system.log.10-05, thread_dump.log, threads.txt > > > I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and 500-700GB per node. SSTable counts are <10 per table. I am attempting to provision additional nodes, but bootstrapping OOMs every time after about 10 hours with a sudden long GC pause: > {noformat} > INFO [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old Generation GC in 1586126ms. G1 Old Gen: 49213756976 -> 49072277176; > ... > ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 CassandraDaemon.java:223 - Exception in thread Thread[MemtableFlushWriter:454,5,main] > java.lang.OutOfMemoryError: Java heap space > {noformat} > I have tried increasing max heap to 48G just to get through the bootstrap, to no avail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)