From user-return-63159-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Mon Feb 11 08:06:09 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id CFB09180648 for ; Mon, 11 Feb 2019 09:06:08 +0100 (CET) Received: (qmail 65053 invoked by uid 500); 11 Feb 2019 08:06:06 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 65042 invoked by uid 99); 11 Feb 2019 08:06:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Feb 2019 08:06:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 418041823D2 for ; Mon, 11 Feb 2019 08:06:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.799 X-Spam-Level: * X-Spam-Status: No, score=1.799 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=backblaze.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id E2UOocc38k6u for ; Mon, 11 Feb 2019 08:06:01 +0000 (UTC) Received: from mail-qk1-f195.google.com (mail-qk1-f195.google.com [209.85.222.195]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4A793610E1 for ; Mon, 11 Feb 2019 08:06:01 +0000 (UTC) Received: by mail-qk1-f195.google.com with SMTP id w204so5908836qka.2 for ; Mon, 11 Feb 2019 00:06:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=backblaze.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=8IWjDRi+UCQq+B3fWtbYLo98lyLDSpdnWq1nr1a4ToE=; b=SnPoMgcA/bklZdUC6MJkzagP20vCH/mncZVLQxxuNtvlY7jtHTC2d4DPcBk9eAKVDu ytN3W8Dea5gqVCEjrZWB9uqt7g30zVnoSmPBMO57lF3FXOhJl6MMGf3DcWZ7Jw3RnTxt L+mT6xO+7Ewmzm1X9iuex1w2s8qHSVQPNYT7IUqHl67zzBTlI2RICgjWgfuvYe8h1Bbw VMc3iwRbG2q298UexjW/+uxFTFsd+MzpTH++CpWoUXDvwdZXcOtGCrvmoyRu4SxjZBTi rWgx56+3tJM3KJeVdcJ9RFvIbRKtUNibnT90r/+18LAOb1qMbC023D/7HSqwX/kcciUi nI7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=8IWjDRi+UCQq+B3fWtbYLo98lyLDSpdnWq1nr1a4ToE=; b=PeY9hSTF1ICLutkKRFwiUlhaCQw0suJgD4KAEJFFxDBRgKkj9KcjG3PhRxEoCJ+qlt QWv5G1B/W/roq7l8kZTMVwgmhVaiiR9o5/J0JwS1EiN4tB5mhZzvZ56fcKYCW+79WP6I 817VlVsGmLmmcT72EBsL4rYViO8aL4s0Zi8EHdoiYg9lN5zzYm1ub3P5zTxMpF1Kyb4M i0gF6MAA5Cu1rmYai1JZ0JviCHfnRGINhiqju+z0q6KlsA3exQeNfHedZO+OD4lducw8 BXOIah99DDuN1439zonUXENFm9RUyjoY475m6kR+peUJleG51Di21ulM59WE3LiqJjaD 8TOw== X-Gm-Message-State: AHQUAuaa2YrlI/3CcK4l8/iOAPCfzDmsQ/ldZScGX/Cy5bZ/z3bBvej1 86Umto5lkvMZfE3SINHtxfTnvgyWUVmAioCZGSKIZBrn X-Google-Smtp-Source: AHgI3IbCwIzpetYGrXVhGpDqoKHa5bcWkqoKHlKRixWWqdvaeDKCRD+uO3hOQZeUruG7lXJjPoWdrljjKxugeBaI5z4= X-Received: by 2002:a37:c442:: with SMTP id h2mr23680689qkm.53.1549872359813; Mon, 11 Feb 2019 00:05:59 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Elliott Sims Date: Mon, 11 Feb 2019 02:05:49 -0600 Message-ID: Subject: Re: High GC pauses leading to client seeing impact To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="000000000000744900058199c751" --000000000000744900058199c751 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I would strongly suggest you consider an upgrade to 3.11.x. I found it decreased space needed by about 30% in addition to significantly lowering GC. As a first step, though, why not just revert to CMS for now if that was working ok for you? Then you can convert one host for diagnosis/tuning so the cluster as a whole stays functional. That's also a pretty old version of the JDK to be using G1. I would definitely upgrade that to 1.8u202 and see if the problem goes away. On Sun, Feb 10, 2019, 10:22 PM Rajsekhar Mallick Hello Team, > > I have a cluster of 17 nodes in production.(8 and 9 nodes in 2 DC). > Cassandra version: 2.0.11 > Client connecting using thrift over port 9160 > Jdk version : 1.8.066 > GC used : G1GC (16GB heap) > Other GC settings: > Maxgcpausemillis=3D200 > Parallels gc threads=3D32 > Concurrent gc threads=3D 10 > Initiatingheapoccupancypercent=3D50 > Number of cpu cores for each system : 40 > Memory size: 185 GB > Read/sec : 300 /sec on each node > Writes/sec : 300/sec on each node > Compaction strategy used : Size tiered compaction strategy > > Identified issues in the cluster: > 1. Disk space usage across all nodes in the cluster is 80%. We are > currently working on adding more storage on each node > 2. There are 2 tables for which we keep on seeing large number of > tombstones. One of table has read requests seeing 120 tombstones cells in > last 5 mins as compared to 4 live cells. Tombstone warns and Error messag= es > of query getting aborted is also seen. > > Current issue sen: > 1. We keep on seeing GC pauses of few minutes randomly across nodes in th= e > cluster. GC pauses of 120 seconds, even 770 seconds are also seen. > 2. This leads to nodes getting stalled and client seeing direct impact > 3. The GC pause we see, are not during any of G1GC phases. The GC log > message prints =E2=80=9CTime to stop threads took 770 seconds=E2=80=9D. S= o it is not the > garbage collector doing any work but stopping the threads at a safe point > is taking so much of time. > 4. This issue has surfaced recently after we changed 8GB(CMS) to > 16GB(G1GC) across all nodes in the cluster. > > Kindly do help on the above issue. I am not able to exactly understand if > the GC is wrongly tuned, other if this is something else. > > Thanks, > Rajsekhar Mallick > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org > For additional commands, e-mail: user-help@cassandra.apache.org > > --000000000000744900058199c751 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I would strongly suggest you consider an upgrade to = 3.11.x.=C2=A0 I found it decreased space needed by about 30% in addition to= significantly lowering GC.

As a first step, though, why not just revert to CMS for now if that was= working ok for you?=C2=A0 Then you can convert one host for diagnosis/tuni= ng so the cluster as a whole stays functional.

<= /div>
That's also a pretty old version of the JDK to b= e using G1.=C2=A0 I would definitely upgrade that to 1.8u202 and see if the= problem goes away.

On Sun, Feb 10, 2019, 10:22 PM Rajsekhar Mallick <raj.mallick14@gmail.com wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">Hello Team,

I have a cluster of 17 nodes in production.(8 and 9 nodes in 2 DC).
Cassandra version: 2.0.11
Client connecting using thrift over port 9160
Jdk version : 1.8.066
GC used : G1GC (16GB heap)
Other GC settings:
Maxgcpausemillis=3D200
Parallels gc threads=3D32
Concurrent gc threads=3D 10
Initiatingheapoccupancypercent=3D50
Number of cpu cores for each system : 40
Memory size: 185 GB
Read/sec : 300 /sec on each node
Writes/sec : 300/sec on each node
Compaction strategy used : Size tiered compaction strategy

Identified issues in the cluster:
1. Disk space usage across all nodes in the cluster is 80%. We are currentl= y working on adding more storage on each node
2. There are 2 tables for which we keep on seeing large number of tombstone= s. One of table has read requests seeing 120 tombstones cells in last 5 min= s as compared to 4 live cells. Tombstone warns and Error messages of query = getting aborted is also seen.

Current issue sen:
1. We keep on seeing GC pauses of few minutes randomly across nodes in the = cluster. GC pauses of 120 seconds, even 770 seconds are also seen.
2. This leads to nodes getting stalled and client seeing direct impact
3. The GC pause we see, are not during any of G1GC phases. The GC log messa= ge prints =E2=80=9CTime to stop threads took 770 seconds=E2=80=9D. So it is= not the garbage collector doing any work but stopping the threads at a saf= e point is taking so much of time.
4. This issue has surfaced recently after we changed 8GB(CMS) to 16GB(G1GC)= across all nodes in the cluster.

Kindly do help on the above issue. I am not able to exactly understand if t= he GC is wrongly tuned, other if this is something else.

Thanks,
Rajsekhar Mallick



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apach= e.org
For additional commands, e-mail: user-help@cassandra.apache.org=

--000000000000744900058199c751--