Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EE5F6200D16 for ; Mon, 25 Sep 2017 12:55:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id ECFFB1609BB; Mon, 25 Sep 2017 10:55:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3E7A91609C4 for ; Mon, 25 Sep 2017 12:55:04 +0200 (CEST) Received: (qmail 4255 invoked by uid 500); 25 Sep 2017 10:55:03 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 4235 invoked by uid 99); 25 Sep 2017 10:55:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Sep 2017 10:55:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id DB47FDBD4D for ; Mon, 25 Sep 2017 10:55:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -98.702 X-Spam-Level: X-Spam-Status: No, score=-98.702 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_NUMSUBJECT=0.5, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id WJIuH_uSYEiA for ; Mon, 25 Sep 2017 10:55:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 63C075F3D0 for ; Mon, 25 Sep 2017 10:55:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 98EAAE00A7 for ; Mon, 25 Sep 2017 10:55:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5330C24211 for ; Mon, 25 Sep 2017 10:55:00 +0000 (UTC) Date: Mon, 25 Sep 2017 10:55:00 +0000 (UTC) From: "Thomas Steinmaurer (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-13900) Massive GC suspension increase after updating to 3.0.14 from 2.1.18 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 25 Sep 2017 10:55:05 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-13900: ------------------------------------------- Description: In short: After upgrading to 3.0.14 (2.1.18), we aren't able to process the same incoming write load on the same infrastructure anymore. We have a loadtest environment running 24x7 testing our software using Cassandra as backend. Both, loadtest and production is hosted in AWS and do have the same spec on the Cassandra-side, namely: * 9x m4.xlarge * 8G heap * CMS (400MB newgen) * 2TB EBS gp2 per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster AVG with constant, simulated load running against our cluster, using Cassandra 2.1 for > 2 years now. Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment, and basically, 3.0.14 isn't able to cope with the load anymore. No particular special tweaks, memory settings/changes etc., all the same as in 2.1.8. We also didn't upgrade sstables yet, thus the increase mentioned in the screenshot is not related to any manually triggered maintenance operation after upgrading to 3.0.14. According to our monitoring, with 3.0.14, we see a GC suspension time increase by a factor of > 2, of course directly correlating with an CPU increase > 80%. See: attached screen "cassandra2118_vs_3014.jpg" This all means that our incoming load for several weeks now against 2.1.18 is something, 3.0.14 can't handle. So, we would need to either scale up (e.g. to m4.2xlarge) or scale out for being able to handle the same load was: In short: After upgrading to 3.0.14 (2.1.18), we aren't able to process the same incoming write load on the same infrastructure anymore. We have a loadtest environment running 24x7 testing our software using Cassandra as backend. Both, loadtest and production is hosted in AWS and do have the same spec on the Cassandra-side, namely: * 9x m4.xlarge * 8G heap * CMS (400MB newgen) * 2TB EBS gp2 per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster AVG with constant, simulated load running against our cluster, using Cassandra 2.1 for > 2 years now. Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment, and basically, 3.0.14 isn't able to cope with the load anymore. No particular special tweaks, memory settings/changes etc., all the same as in 2.1.8. We also didn't upgrade sstables yet, thus the increase mentioned below is not related to any manually triggered maintenance operation after upgrading to 3.0.14. According to our monitoring, with 3.0.14, we see a GC suspension time increase by a factor of > 2, of course directly correlating with an CPU increase > 80%. See: attached screen "cassandra2118_vs_3014.jpg" This all means that our incoming load for several weeks now against 2.1.18 is something, 3.0.14 can't handle. So, we would need to either scale up (e.g. to m4.2xlarge) or scale out for being able to handle the same load > Massive GC suspension increase after updating to 3.0.14 from 2.1.18 > ------------------------------------------------------------------- > > Key: CASSANDRA-13900 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13900 > Project: Cassandra > Issue Type: Bug > Reporter: Thomas Steinmaurer > Priority: Blocker > Attachments: cassandra2118_vs_3014.jpg > > > In short: After upgrading to 3.0.14 (2.1.18), we aren't able to process the same incoming write load on the same infrastructure anymore. > We have a loadtest environment running 24x7 testing our software using Cassandra as backend. Both, loadtest and production is hosted in AWS and do have the same spec on the Cassandra-side, namely: > * 9x m4.xlarge > * 8G heap > * CMS (400MB newgen) > * 2TB EBS gp2 > per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster AVG with constant, simulated load running against our cluster, using Cassandra 2.1 for > 2 years now. > Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment, and basically, 3.0.14 isn't able to cope with the load anymore. No particular special tweaks, memory settings/changes etc., all the same as in 2.1.8. We also didn't upgrade sstables yet, thus the increase mentioned in the screenshot is not related to any manually triggered maintenance operation after upgrading to 3.0.14. > According to our monitoring, with 3.0.14, we see a GC suspension time increase by a factor of > 2, of course directly correlating with an CPU increase > 80%. See: attached screen "cassandra2118_vs_3014.jpg" > This all means that our incoming load for several weeks now against 2.1.18 is something, 3.0.14 can't handle. So, we would need to either scale up (e.g. to m4.2xlarge) or scale out for being able to handle the same load -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org For additional commands, e-mail: commits-help@cassandra.apache.org